We have a \(\text{Beta}(4, 4)\) prior, toss a coin ten times, and observe fewer than 3 heads (but we don’t know the exact number of heads).
\[p(\theta) \propto \theta^{4 - 1} (1 - \theta)^{4-1}\] \[p(y \lvert \theta) = {10 \choose 0} \theta^0 (1 - \theta)^{10} + {10 \choose 1} \theta^1 (1 - \theta)^9 + {10 \choose 2} \theta^2 (1 - \theta)^8\] \[= (1 - \theta)^{10} + 10\theta (1 - \theta)^9 + 45 \theta^2 (1 - \theta)\]
\[p(\theta \lvert y) = [(1 - \theta)^{10} + 10\theta (1 - \theta)^9 + 45 \theta^2 (1 - \theta)] \cdot \theta^3(1 - \theta)^3\] \[= \theta^3(1 - \theta)^{13} + 10 \theta^4(1 - \theta)^{12} + 45 \theta^5 (1 - \theta)^{11}\]
Plot the posterior:
theta <- seq(0, 1, 0.01)
post <- theta^3 * (1 - theta)^13 + 10 * theta^4 * (1 - theta)^12 +
45 * theta^5 * (1 - theta)^11
post <- post/sum(post)
plot(theta, post, type = 'l', col = 'steelblue',
axes = FALSE, lwd = 2,
xlab = expression(theta), ylab = 'Density')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
We randomly select one of two coins, with equal probability. One coin has 0.6 probability of landing heads, the other has 0.4 probability of landing heads. We toss the coin two times and observe two tails. What’s the posterior expectation of the number of further tosses until we get a head?
\[P(C_1) = P(C_2) = \frac{1}{2}\] \[C_1: \; \theta = 0.6; \qquad C_2: \; \theta = 0.4\] \[p(y = 0 \lvert \theta) = \begin{cases} 0.6^0 (1 - 0.6)^2 = 0.4^2 & \text{if } \theta = 0.6 \\ 0.4^0 (1 - 0.4)^2 = 0.6^2 & \text{if } \theta = 0.4 \end{cases}\] \[p(\theta_i \lvert y) = \begin{cases} \frac{0.4^2}{0.4^2 + 0.6^2} \approx 0.308 & \text{for } \theta_i = 0.6 \\ \frac{0.6^2}{0.4^2 + 0.6^2} \approx 0.682 & \text{for } \theta_i = 0.4 \end{cases} \]
\[E(\tilde y \lvert y) = \sum_{i = 1}^2 \sum_{j=0}^\infty p(\tilde y \lvert \theta_i) p(\theta_i \lvert y) = \sum_{i = 1}^2 p(\tilde y \lvert \theta_i) \sum_{j=0}^\infty j \cdot p(\theta_i \lvert y)\] \[= \frac{0.4^2}{0.4^2 + 0.6^2} \cdot \sum_{j=0}^\infty j \cdot 0.4 (1 -0.4)^j + \frac{0.4^2}{0.6^2 + 0.6^2} \cdot \sum_{j=0}^\infty j \cdot 0.6 (1 -0.6)^j\] \[= \frac{0.4^2}{0.4^2 + 0.6^2} \cdot \frac{1}{0.6} + \frac{0.4^2}{0.6^2 + 0.6^2} \cdot \frac{1}{0.4} \qquad \text{(by expectation of Geometric distribution)}\]
0.4^2 / (0.4^2 + 0.6^2) * (1 / 0.6) +
0.6^2 / (0.4^2 + 0.6^2) * (1 / 0.4)
## [1] 2.24359
Let \(Y\) be the number of 6’s in 1000 rolls of a fair die.
Approximate \(p(y \lvert \theta)\) using normal distribution:
\[E(Y) = np = 1000 \cdot \frac{1}{6} = 166.66 \bar 6\] \[Var(Y) = np(1-p) = 1000 \cdot \frac{1}{6} \cdot \frac{5}{6} = 138.88 \bar 8\]
mu <- 1000 * 1/6
sigma2 <- 1000 * 1/6 * 5/6
y <- 120:220
plot(y, dnorm(y, mu, sqrt(sigma2)),
type = 'l', col = 'steelblue',
axes = FALSE, lwd = 2,
xlab = 'y', ylab = 'Density')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Give approximate 5%, 25%, 50%, 75% and 95% points for the distribution of \(y\)
qnorm(c(0.05, 0.25, 0.5, 0.75, 0.95), mu, sqrt(sigma2))
## [1] 147.2819 158.7177 166.6667 174.6156 186.0515
Again, we count the number of 6’s in 1000 tosses of a die that may or may not be fair with the following prior probabilities:
\[p \bigg(\theta = \frac{1}{12} \bigg) = \frac{1}{4}\] \[p \bigg(\theta = \frac{1}{6} \bigg) = \frac{1}{2}\] \[p \bigg(\theta = \frac{1}{4} \bigg) = \frac{1}{4}\]
Approximate \(p(y \lvert \theta)\) using a normal approximation:
\[p(y \lvert \theta) = \frac{1}{4} \cdot p \bigg(y \lvert \theta = \frac{1}{12} \bigg) + \frac{1}{2} \cdot p \bigg(y \lvert \theta = \frac{1}{6} \bigg) + \frac{1}{4} \cdot p \bigg(y \lvert \theta = \frac{1}{4} \bigg)\]
y <- 0:400
theta <- c(1/12, 1/6, 1/4)
mu <- 1000 * theta
sigma2 <- 1000 * theta * (1 - theta)
post <- sapply(y, function(x) dnorm(x, mu, sqrt(sigma2)))
post <- colSums(post * c(1/4, 1/2, 1/4))
plot(y, post,
type = 'l', col = 'steelblue',
axes = FALSE, lwd = 2,
xlab = 'y', ylab = 'Density')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Give approximate 5%, 25%, 50%, 75%, and 95% points for the distribution of \(y\). We can make use of the fact that close to 1/4 of the probability mass will be in the first normal hump, 1/2 will be in the second normal hump, and 1/4 will be in the third normal hump (since the humps share little overlap):
q5 <- qnorm(0.2, mu[1], sqrt(sigma2[1]))
q25 <- mu[1] + (mu[2] - mu[1]) / 2
q50 <- qnorm(0.5, mu[2], sqrt(sigma2[2]))
q75 <- mu[2] + (mu[3] - mu[2]) / 2
q95 <- qnorm(0.8, mu[3], sqrt(sigma2[3]))
plot(y, post,
type = 'l', col = 'steelblue',
axes = FALSE, lwd = 2,
xlab = 'y', ylab = 'Density')
abline(v = c(q5, q25, q50, q75, q95),
col = 'grey60', lty = 'dashed')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Let \(y\) be number of heads in \(n\) spins of a coin with probability of heads \(\theta\).
Assuming a \(\text{Uniform}(0, 1)\) prior on \(\theta\), derrive the prior predictive distribution:
\[p(y = k) = \int_{0}^1 p(y = k \lvert \theta) d \theta\] \[= \int_0^1 {n \choose k} \theta^k (1 - \theta)^{n - k} d \theta\] \[= {n \choose k} \int_0^1 \theta^k (1 - \theta)^{n - k} d \theta \qquad \text{(Beta integral)}\] \[= \frac{n!}{k! (n - k)!} \cdot \frac{\Gamma(k + 1) \Gamma(n - k + 1)}{\Gamma(n + 2)}\] \[= \frac{1}{n + 1}\]
Assuming \(\text{Beta}(\alpha, \beta)\) prior, show that the posterior mean lies between prior mean \(\frac{\alpha}{\alpha + \beta}\) and the sample mean/MLE \(\frac{y}{n}\)
The mean of the Beta posterior is \(\frac{\alpha + y}{\alpha + \beta + n}\).
\[\frac{\alpha + y}{\alpha + \beta + n} = w \cdot \frac{\alpha}{\alpha + \beta} + (1 - w) \frac{y}{n}\] \[\implies \frac{\alpha + y}{\alpha + \beta + n} - \frac{y}{n} = w \bigg( \frac{\alpha}{\alpha + \beta} - \frac{y}{n} \bigg)\] \[\implies \frac{ \alpha n + ny - \alpha y - \beta y - ny}{n(\alpha + \beta + n)} = w \cdot \bigg( \frac{\alpha n - y (\alpha + \beta)}{n(\alpha + \beta)} \bigg)\] \[\implies \frac{ \alpha n + - \alpha y - \beta y}{n(\alpha + \beta + n)} = w \cdot \bigg( \frac{\alpha n - \alpha y - \beta y}{n(\alpha + \beta)} \bigg)\] \[\implies w = \frac{\frac{n \alpha + - \alpha y - \beta y}{n(\alpha + \beta + n)}}{\frac{\alpha n - \alpha y - \beta y}{n(\alpha + \beta)}}\] \[\implies w = \frac{\alpha + \beta}{\alpha + \beta + n}; \qquad (1 - w) = \frac{n}{\alpha + \beta + n}\] Since \(\alpha > 0\), \(\beta > 0\) and \(n \in \{1, 2, 3, \ldots \}\), \(w\) will always be between 0 and 1:
\[0 < w < 1; \qquad 0 < 1 - w < 1\]
Thus, the posterior mean will always lie between the prior mean and the sample mean.
Show that if the prior is uniform, the posterior variance will always be smaller than the prior variance.
The variance of a \(\text{Beta}(\alpha, \beta)\) distribution is: \(\frac{\alpha \beta}{(\alpha + \beta)^2 \cdot (\alpha + \beta + 1)}\).
Therefore, with a uniform \(\text{Beta}(1, 1)\) prior, the prior variance will be:
\[Var(\theta) = \frac{1 \cdot 1}{(1 + 1)^2 \cdot (1 + 1 + 1)} = \frac{1}{12}\] The posterior variance will be:
\[Var(\theta \lvert y) = \frac{(1 + y) \cdot (1 + n - y)}{(1 + y + 1 + n -y)^2 (1 + y + 1 + n - y + 1)} = \frac{(1 + y)(1 + n - y)}{(n + 2)^2 (n + 3)}\] Let’s take the extreme case where the data is the least informative it can be: \(n = 1\) and \(y = 0\) (we could also do \(y=1\)). Then the posterior variance will be:
\[Var(\theta \lvert y = 0) = \frac{(1 + 0) \cdot (1 + 1 - 0)}{(1 + 2)^2 (1 + 3)} = \frac{2}{36} = \frac{1}{18} < \frac{1}{12}\] The posterior variance will then be lower than the prior variance.
We can also re-factorize the posterior variance in the following way:
\[Var(\theta \lvert y) = \frac{(1 + y)(1 + n - y)}{(n + 2)^2 (n + 3)} = \bigg( \frac{1 + y}{n + 2} \bigg) \bigg( \frac{1 + n - y}{n + 2} \bigg) \bigg( \frac{1}{n+3} \bigg)\]
\(\bigg( \frac{1 + y}{n + 2} \bigg)\) and \(\bigg( \frac{1 + n - y}{n + 2} \bigg)\) sum to 1, so their product can be at most \(\frac{1}{4}\) (if each is exactly \(\frac{1}{2}\)). \(\frac{1}{n + 3}\) is smaller than or equal to \(\frac{1}{4}\), so the posterior variance will be at least \(\frac{1}{16} < \frac{1}{12}\).
Give an example of \(\text{Beta}(\alpha, \beta)\) prior and data \(y\), \(n\), in which posterior variance is higher than prior variance.
beta_var <- function(a, b, n, y) {
((a + y) * (b + n)) / ((a + b + n)^2 * (a + b + n + 1))
}
c(beta_var(1, 5, 0, 0), beta_var(1, 5, 1, 1))
## [1] 0.01984127 0.03061224
c(beta_var(5, 1, 0, 0), beta_var(5, 1, 1, 1))
## [1] 0.01984127 0.03061224
c(beta_var(10, 1, 0, 0), beta_var(10, 1, 10, 1))
## [1] 0.006887052 0.012471655
Works for various cases with low values of \(y\) and \(n\).
We have:
\[y_j \sim \text{Poisson}(10n_j \theta_j); \qquad \theta_j \sim \text{Gamma}(\alpha, \beta)\] We’re asked to find \(E(y_j)\) and \(Var(y_j)\). We can use the formulas for conditional mean and variance:
\[E(Y) = E(E(Y \lvert \theta))\] \[Var(Y) = Var(E(Y \lvert \theta)) + E(Var(Y \lvert \theta))\] Thus:
\[E(y_j) = E(E(y_j \lvert \theta_j)) = E(10n_j \theta_j) = 10n_j \frac{\alpha}{\beta}\] \[Var(y_j) = Var(E(y_j \lvert \theta_j)) + E(Var(y_j \lvert \theta_j))\] \[= Var(10n_j \theta_j) + E(10n_j \theta_j) = (10n_j)^2 \frac{\alpha}{\beta^2} + 10n_j \frac{\alpha}{\beta}\]
For binomial likelihood, show that \(\frac{1}{\theta (1 - \theta)}\) is the uniform prior distribution for the natural parameter of the exponential family.
\[p(y \lvert \theta) \propto \theta^y (1- \theta)^{n - y}\] \[= exp(y \cdot log(\theta) - (1 - y) \cdot log(1 - \theta))\] \[= exp \bigg(y \cdot log \bigg(\frac{\theta}{1 - \theta} \bigg) - (- y \cdot log(1 - \theta)) \bigg)\] Thus:
\[\phi = \phi(\theta) = log \bigg( \frac{\theta}{1 - \theta} \bigg)\] \[\implies \theta = \theta(\phi) = \frac{e^{\phi}}{1 + e^{\phi}}\] \[\frac{\partial \theta}{\partial \phi} = \frac{e^{\phi}(1 + e^{\phi}) - e^{\phi} e^{\phi}}{(1 + e^{\phi})^2} = \frac{e^{\phi}}{(1 + e^{\phi})^2}\]
Thus since \(p(\theta) = \frac{1}{\theta (1 - \theta)}\):
\[p(\phi) = p_{\theta}(\theta(\phi)) \cdot \bigg\lvert \frac{\partial \theta}{\partial \phi} \bigg\lvert = \frac{1}{\frac{e^{\phi}}{1 + e^{\phi}} \cdot (1 - \frac{e^{\phi}}{1 + e^{\phi}})} \cdot \frac{e^{\phi}}{(1 + e^{\phi})^2}\] \[= \frac{1}{\frac{e^{\phi}}{1 + e^{\phi}} \cdot \frac{1}{1 + e^{\phi}}} \cdot \frac{e^{\phi}}{(1 + e^{\phi})^2}\] \[= \frac{1}{\frac{e^{\phi}}{(1 + e^{\phi})^2}} \cdot \frac{e^{\phi}}{(1 + e^{\phi})^2} = 1\]
Show that if \(n=0\) or \(y=0\) the resulting posterior is improper:
\[p(\theta \lvert y) = p(y \lvert \theta) p(\theta) \propto \theta^y (1 - \theta)^{n-y} \cdot \theta^{-1} (1 - \theta)^{-1}\] \[= \theta^{y - 1}(1 - \theta)^{n - y - 1}\]
In order for the distribution to be proper, it needs to integrate to 1. For \(y = 0\)
\[\int_0^1 \theta^{0-1} (1 - \theta)^{n-0-1} d\theta = \int_0^1 \theta^{-1} (1 - \theta)^{n-1} d \theta = \infty\] (since the integrand will become infinite as \(\theta \to 0\))
Likewise, for \(n = 0\):
\[\int_0^1 \theta^{y-1} (1 - \theta)^{0-y-1} d\theta = \int_0^1 \theta^{y-1} (1 - \theta)^{-y-1} d \theta = \infty\] (again, since the integrand will become infinite as \(\theta \to 1\))
Weights of random sample of \(n\) students are recorded, resulting in mean weight \(\bar y = 150\). Assume that the weight of the population are distributed around unknown mean \(\theta\) with a standard deviation \(20\). Assume a \(\text{Normal}(180, 40^2)\) prior for \(\theta\).
Give posterior distribution for \(\theta\).
Since both the prior and the likelihood are normal, we know that the posterior will be normal with the following mean and variance:
\[E(\theta \lvert y) = \frac{\frac{n \bar y}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} = \frac{\frac{150n}{20^2} + \frac{180}{40^2}}{\frac{n}{20^2} + \frac{1}{40^2}}\] \[Var(\theta \lvert y) = \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} = \frac{1}{\frac{n}{20^2} + \frac{1}{40^2}}\] Thus:
\[\theta \lvert y \sim \text{Normal} \Bigg( \frac{\frac{150n}{20^2} + \frac{180}{40^2}}{\frac{n}{20^2} + \frac{1}{40^2}} , \frac{1}{\frac{n}{20^2} + \frac{1}{40^2}}\Bigg)\]
Give a posterior predictive distribution for a new student \(\tilde y\).
The posterior predictive distribution will again be a normal, since it is a product of normals:
\[p(\tilde y) = p(\tilde y \lvert \theta) p( \theta \lvert y) p(\theta)\] \[E(\tilde y ) = E(E(\tilde y \lvert \theta, y)) = E(\theta \lvert y) = \mu\] \[Var( \tilde y) = Var(E(\tilde y \lvert \theta, y)) + E(Var( \tilde y \lvert \theta, y)) = Var(\theta \lvert y) + E(\sigma^2 \lvert y)\] \[= \sigma_n^2 + \sigma^2\]
Therefore, the posterior predictive distribution will be:
\[\tilde y \sim \text{Normal} \Bigg( \frac{\frac{150n}{20^2} + \frac{180}{40^2}}{\frac{n}{20^2} + \frac{1}{40^2}} , \frac{1}{\frac{n}{20^2} + \frac{1}{40^2}} + 20^2 \Bigg)\]
For \(n=10\), give 95% credible interval for \(\theta\) & 95% posterior predictive interval for \(\tilde y\):
n <- 10
mu_theta <- (n / 20^2 * 150 + 1/40^2 * 180) / (n/20^2 + 1/40^2)
sigma2_theta <- 1 / (n/20^2 + 1/40^2)
sigma2_yt <- 1 / (n/20^2 + 1/40^2) + 20^2
qnorm(c(0.025, 0.975), mu_theta, sqrt(sigma2_theta))
## [1] 138.4879 162.9755
qnorm(c(0.025, 0.975), mu_theta, sqrt(sigma2_yt))
## [1] 109.6648 191.7987
Do the same for \(n = 100\)
n <- 100
mu_theta <- (n / 20^2 * 150 + 1/40^2 * 180) / (n/20^2 + 1/40^2)
sigma2_theta <- 1 / (n/20^2 + 1/40^2)
sigma2_yt <- 1 / (n/20^2 + 1/40^2) + 20^2
qnorm(c(0.025, 0.975), mu_theta, sqrt(sigma2_theta))
## [1] 146.1598 153.9899
qnorm(c(0.025, 0.975), mu_theta, sqrt(sigma2_yt))
## [1] 110.6805 189.4691
The posterior for \(\theta\) is a lot more precise, however the posterior predictive distribution does not change much (because the uncertainty about \(\theta\) accounted for relatively little compared to the population variance).
Let \(\theta\) be the proportion of Californians who support the death penalty, and our prior for \(\theta\) is Beta with a mean 0.6 and standard deviation 0.3.
Determine the \(\alpha\) and \(\beta\) parameters of the prior above.
\[E(\theta) = \frac{\alpha}{\alpha + \beta} = 0.6\] \[Var(\theta) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} = (0.3)^2 = 0.09\] Thus:
\[\frac{\alpha}{\alpha + \beta} = 0.6 \implies 0.4 \alpha = 0.6 \beta \implies \alpha = \frac{3}{2} \beta\]
\[\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} = (0.3)^2 = 0.09\] \[\implies \frac{\frac{3}{2} \beta^2}{(\frac{5}{2} \beta)^2 (\frac{5}{2} \beta + 1)} = 0.09\] \[\implies \frac{3}{2} = 0.09 \cdot \frac{25}{4} \cdot \bigg( \frac{5}{2} \beta + 1 \bigg)\] \[\implies 3 \cdot 50 = 9 \bigg( \frac{125}{8} \beta + \frac{25}{4} \bigg)\] \[\implies 150 \cdot 8 = 9 \cdot 125 \beta + 18 \cdot 25\] \[\implies 750 = 1125 \beta\] \[\implies \beta = \frac{2}{3}\] \[\implies \alpha = 1\]
A sample of 1,000 Californians is taken and 65% support the death penalty. What is the posterior mean and variance for \(\theta\)? Draw the posterior density.
a <- 1 + 650
b <- 1 + 1000 - 650
theta <- seq(0, 1, 0.001)
plot(theta, dbeta(theta, a, b), type = 'l',
xlim = c(0.55, 0.75), col = 'steelblue', lwd = 2,
xlab = expression(theta), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Suppose there are \(N\) cable cars in San Francisco, numbered 1 to \(N\). You see a cable car at random numbered 203.
Assume geometric distribution on \(N\) with a mean of 100:
\[p(N) = \frac{1}{100} \cdot \bigg(\frac{99}{100} \bigg)^{N-1}\]
What is the posterior distribution?
\[p(N \lvert x = 203) \propto p(X=203 \lvert N) p(N)\] \[= \frac{1}{N} \cdot \frac{1}{100} \cdot \bigg(\frac{99}{100} \bigg)^{N-1} \qquad \text{(for } N\geq 203)\] \[\propto \frac{1}{N} \cdot \bigg(\frac{99}{100} \bigg)^{N-1} \qquad \text{(for } N\geq 203)\]
To be able to evaluate the posterior, we need to compute the normalizing constant \(c\) where:
\[\frac{1}{c} = \sum_{i=203}^\infty \frac{1}{N} \bigg( \frac{99}{100} \bigg)^{N-1}\]
We can compute this constant & the posterior over a finite grid of values in R:
N <- 203:1000
ci <- sum(1/N * (99/100)^(N-1))
c <- 1 / ci
post <- c * 1/N * (99/100)^(N - 1)
# Compute posterior mean
(post_mean <- sum(N * post))
## [1] 279.0202
# Compute posterior variance
(post_variance <- sum((N - post_mean)^2 * post))
## [1] 6338.603
plot(N, post, type = 'l',
col = 'steelblue', lwd = 2,
xlab = expression(theta), ylab = 'Density', axes = FALSE)
abline(v = post_mean, col = 'grey60', lty = 'dashed')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Choose a reasonable “non-informative” prior for \(N\) & compute posterior mean and variance.
Using \(p(N) = \frac{1}{N}\):
\[p(N \lvert x = 203) \propto \frac{1}{N} \cdot \frac{1}{N} = \frac{1}{N^2}\] \[E(N \lvert x = 203) = \sum_{N=203}^\infty \frac{1}{N^2}\] \[= \sum_{N=1}^\infty \frac{1}{N^2} - \sum_{N=1}^{202} \frac{1}{N^2}\]
We observe 5 observations from a Cauchy distribution with unknown center \(\theta\): \((43, 44, 45, 46.5, 47.5)\). Assume a \(\text{Uniform}(0, 100)\) prior on \(\theta\).
Compute the unnormalized posterior density function on a grid of points. Using the grid approximation, compute and plot the normalized posterior:
y <- c(43:45, 46.5, 47.5)
theta <- seq(0, 100, 0.1)
post <- sapply(theta, function(theta)
exp(sum(-log((1 + (y - theta)^2)))))
post <- post / sum(post)
plot(theta, post, type = 'l',
col = 'steelblue', lwd = 2, xlim = c(40, 50),
xlab = expression(theta), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Sample 1,000 draws of \(\theta\) from the posterior density and plot a histogram:
theta_s <- sample(theta, 1000, prob = post,
replace = TRUE)
hist(theta_s, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(theta[s]), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Use the 1,000 sample of \(\theta\) to obtain 1,000 samples from the posterior predictive distribution of a future observation \(y_6\) and plot:
set.seed(123456)
y6 <- rcauchy(1000, theta_s)
hist(y6, axes = FALSE, main = NULL, breaks = 100,
col = 'steelblue', border = 'white',
xlab = expression(y[6]), ylab = 'Count')
draw_axes()
Let \(y \lvert \theta \sim \text{Poisson}(\theta)\). Derive Jeffrey’s prior for \(\theta\) & then find \(\alpha\) & \(\beta\) parameters for the corresponding Gamma prior distribution:
\[p(y \lvert \theta) = \frac{\theta^y e^{-\theta}}{y!}\] \[log(p(y \lvert \theta)) = y \cdot log(\theta) - \theta - log(y!) \propto y \cdot log(\theta) - \theta\] \[\frac{\partial}{\partial \theta} log(p(y \lvert \theta)) = \frac{y}{\theta} - 1\] \[\frac{\partial^2}{\partial \theta^2} log(p(y \lvert \theta)) = -\frac{y}{\theta^2}\] \[J(\theta) = E \bigg(- \frac{\partial^2}{\partial \theta^2} log(p(y \lvert \theta)) \bigg) = \frac{\theta}{\theta^2} = \frac{1}{\theta}\]
\[\implies p(\theta) = \sqrt{J(\theta)} = \theta^{-\frac{1}{2}}\]
Thus:
\[\theta \sim \text{Gamma}(1/2, 0)\]
We are given number of accidents and deaths on scheduled flights over a 10-year period.
Assume that fatal accidents are \(\text{Poisson}(\theta)\) distributed. Set a prior distribution, determine the posterior distribution, & give a 95% predictive interval for the number of fatal accidents in 1986.
year <- 1976:1985
accidents <- c(24, 25, 31, 31, 22, 21, 26, 20, 16, 22)
deaths <- c(734, 516, 754, 877, 814, 362, 764, 809, 223, 1066)
death_rate <- c(0.19, 0.12, 0.15, 0.16, 0.14, 0.06, 0.13, 0.13, 0.03, 0.15)
theta_s <- rgamma(1e3, sum(accidents), length(accidents))
y1986 <- rpois(1e3, theta_s)
hist(y1986, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = 'Predicted number of accidents in 1986', ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
quantile(y1986, c(0.025, 0.975))
## 2.5% 97.5%
## 14 34
Assume that fatal accidents are Poisson distributed with a constant rate and exposure in each year proportional to the number of miles flown. Set a prior distribution, determine the posterior distribution, & give a 95% predictive interval for the number of fatal accidents in 1986.
miles <- deaths / death_rate
theta_s <- rgamma(1e3, sum(accidents),
length(accidents) * mean(miles) / 1e3)
y1986_2 <- rpois(1e3, theta_s * 8)
hist(y1986_2, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = 'Predicted number of accidents in 1986', ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
quantile(y1986_2, c(0.025, 0.975))
## 2.5% 97.5%
## 22 45
Derive the posterior distribution of a conjugate Normal model:
\[p(y \lvert \theta) \propto exp \bigg(-\frac{1}{2} \bigg( \frac{(y - \mu)^2}{\sigma^2} + \frac{(\mu - \mu_0)^2}{\sigma_0^2} \bigg) \bigg)\] \[= exp \bigg(-\frac{1}{2} \bigg( \frac{y^2}{\sigma^2} - \frac{2y \mu}{\sigma^2} + \frac{\mu^2}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} - \frac{2 \mu \mu_0}{\sigma_0^2} + \frac{\mu_0^2}{\sigma_0^2} \bigg) \bigg)\] \[= exp \bigg(-\frac{1}{2} \bigg(\frac{\mu^2}{\sigma^2} + \frac{\mu^2}{\sigma_0^2} - \frac{2y \mu}{\sigma^2} - \frac{2 \mu \mu_0}{\sigma_0^2} + c \bigg) \bigg)\] \[\propto exp \bigg(-\frac{1}{2} \bigg(\mu^2 \bigg( \frac{1}{\sigma^2} + \frac{1}{\sigma_0^2} \bigg) - 2\mu \bigg( \frac{y }{\sigma^2} - \frac{ \mu_0}{\sigma_0^2} \bigg)\bigg) \bigg)\] \[\propto exp \Bigg(-\frac{1}{2} \cdot \bigg( \frac{1}{\sigma^2} + \frac{1}{\sigma_0^2} \bigg) \cdot \bigg(\mu - \frac{\frac{y}{\sigma^2} - \frac{ \mu_0}{\sigma_0^2}}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}} \bigg)^2 \Bigg) \qquad \text{(by completing the square)}\]
Thus:
\[\mu \lvert y \sim \text{Normal} \Bigg( \frac{\frac{y}{\sigma^2} - \frac{ \mu_0}{\sigma_0^2}}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}}, \frac{1}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}} \Bigg)\]
Derive the conjugate Normal model for 2+ observations by adding 1 observation at a time:
We start with a \(\mu \lvert y_1 \sim \text{Normal} \Bigg( \frac{\frac{y_1}{\sigma^2} - \frac{ \mu_0}{\sigma_0^2}}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}}, \frac{1}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}} \Bigg)\) model (after observing 1 observation). Then:
\[\frac{1}{\sigma_2^2} = \frac{1}{\sigma^2} + \frac{1}{\sigma_1^2} = \frac{1}{\sigma^2} + \bigg( \frac{1}{\sigma^2} + \frac{1}{\sigma_0^2} \bigg) = \frac{2}{\sigma^2} + \frac{1}{\sigma_0^2}\] \[\implies \sigma_2^2 = \frac{1}{\frac{2}{\sigma^2} + \frac{1}{\sigma_0^2}}\]
\[\mu_2 = \frac{\frac{y_2}{\sigma^2} + \frac{\mu_1}{\sigma_1^2}}{\frac{1}{\sigma_2^2}}\] \[= \frac{\frac{y_2}{\sigma^2} + \bigg( \frac{\frac{y_1}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}} \bigg) \cdot \bigg( \frac{1}{\sigma^2} + \frac{1}{\sigma_0^2} \bigg)}{\frac{1}{\sigma_2^2}}\] \[= \frac{\frac{y_2}{\sigma^2} + \frac{y_1}{\sigma^2} + \frac{\mu_0}{\sigma_0^2} }{\frac{1}{\sigma_2^2}}\] \[= \frac{\frac{2 \bar y}{\sigma^2} + \frac{\mu_0}{\sigma^2}}{\frac{2}{\sigma^2} + \frac{1}{\sigma_0^2}}\]
where \(\bar y = \frac{y_1 + y_2}{2}\). It is easy to see that this updating formula will work for \(n = 2, 3, \ldots\).
We have the following result (Beta function):
\[\int_0^1 u^{\alpha - 1} (1 - u)^{\beta - 1} du = \frac{\Gamma(\alpha) \Gamma (\beta)}{\Gamma(\alpha + \beta)}\]
If \(Z\) has a Beta distribution, find \(E[Z^m (1 - z)^n]\), hence find mean and variance of Beta distribution:
\[E[Z^m (1 - z)^n] = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 z^{\alpha + m -1} (1 - z)^{\beta + n - 1} dz\] \[= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \frac{\Gamma(\alpha + m) \Gamma(\beta + n)}{\Gamma(\alpha + \beta + m + n)}\]
If we want the expectation of \(Z\), \(E(Z)\), then clearly \(m=1\) and \(n = 0\):
\[E(Z) = E[Z^1(1 - Z)^0] = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \frac{\Gamma(\alpha + 1) \Gamma(\beta + 0)}{\Gamma(\alpha + \beta + 1 + 0)}\] \[= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \frac{\alpha \cdot\Gamma(\alpha) \Gamma(\beta + 0)}{(\alpha + \beta) \cdot \Gamma(\alpha + \beta)}\] \[= \frac{\alpha}{\alpha + \beta}\] We can do the same for variance of \(Z\):
\[Var(Z) = E(Z^2) - E(Z)^2 = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \frac{\Gamma(\alpha + 2) \Gamma(\beta + 0)}{\Gamma(\alpha + \beta + 2 + 0)} - \bigg( \frac{\alpha}{\alpha + \beta} \bigg)^2\] \[= \frac{(\alpha + 1)\alpha}{(\alpha + \beta + 1) (\alpha + \beta)} - \frac{\alpha^2}{(\alpha + \beta)^2}\] \[= \frac{(\alpha^2 + \alpha)(\alpha + \beta) - \alpha^2 (\alpha + \beta + 1)}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\] \[= \frac{\alpha^3 + \alpha^2 \beta + \alpha^2 + \alpha \beta - \alpha^3 - \alpha^2 \beta - \alpha^2}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\] \[= \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\]
Let \(Y\) have a \(\text{Binomial}(n, \theta)\) distribution and \(\theta\) have a \(\text{Beta}(\alpha, \beta)\) prior.
Find \(p(y)\) i.e. the marginal distribution of \(y\):
\[p(y) = \int_{\theta} p(y \lvert \theta) p(\theta) d \theta\] \[= \int_0^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha - 1} (1 - \theta)^\beta - 1{} d \theta\] \[= {n \choose y} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \int_0^1 \theta^{\alpha + y - 1} (1 - \theta)^{\beta + n-y -1} d \theta\] \[= {n \choose y} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \frac{\Gamma(\alpha + y) \Gamma(\beta + n - y)}{\Gamma(\alpha + \beta + n)}\]
(i.e. the Beta-Binomial distribution)
Show that, if the beta-binomial probability is constant in \(y\), then \(\alpha = \beta = 1\):
\[{n \choose y} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \frac{\Gamma(\alpha + y) \Gamma(\beta + n - y)}{\Gamma(\alpha + \beta + n)} \propto \frac{\Gamma(\alpha + y) \Gamma(\beta + n - y)}{y!(n-y)!} \qquad \text{(all terms not involving } y \text{ are constant)}\] \[= \frac{(\alpha + y - 1)! (\beta + n - y)!}{y! (n - y)!}\]
Clearly, the factorial terms will cancel out if \(\alpha = \beta = 1\), thus the probability will be constant in \(y\).
Let \(\frac{nv}{\sigma^2}\) be distributed as \(\chi^2_n\), and let \(\sigma\) have an improper prior distribution with density given by \(p(\sigma) \propto \frac{1}{\sigma}\)
Prove that the corresponding prior density for \(\sigma^2\) is \(p(\sigma^2) \propto \frac{1}{\sigma^2}\).
Let \(x = \sigma\) and \(y = \sigma^2\).
\[y = y(x) = x^2\] \[x = x(y) = \sqrt{y}\] \[\frac{\partial x}{\partial y} = \frac{1}{2 \sqrt{y}}\]
\[\implies p_Y(y) = p_X(x(y)) \cdot \bigg\lvert \frac{\partial x}{\partial y} \bigg\lvert = \frac{1}{\sqrt{y}} \cdot \frac{1}{2 \sqrt{y}}\] \[= \frac{1}{\sqrt{\sigma^2}} \cdot \frac{1}{2 \sqrt{\sigma^2}} \propto \frac{1}{\sigma^2} \qquad \text{(substituting } y = \sigma^2 \text{ back in)}\]
Show that hte 95% highest posterior density region for \(\sigma^2\) is not the same as the obtained by squaring the endpoints of the posterior interval for \(\sigma\).
\[p(\sigma \lvert \text{data}) \propto p(\text{data} \lvert \sigma)p(\sigma) = \frac{1}{2^{\frac{n}{2}} \Gamma(\frac{n}{2})} \bigg( \frac{nv}{\sigma^2} \bigg)^{\frac{n}{2} - 1} exp \bigg(-\frac{nv}{2\sigma^2} \bigg) \cdot \frac{1}{\sigma} \propto (\sigma^2)^{-\frac{n}{2} - \frac{1}{2}} exp \bigg(-\frac{c}{\sigma^2} \bigg)\] \[p(\sigma^2 \lvert \text{data}) \propto p(\text{data} \lvert \sigma)p(\sigma^2) = \frac{1}{2^{\frac{n}{2}} \Gamma(\frac{n}{2})} \bigg( \frac{nv}{\sigma^2} \bigg)^{\frac{n}{2} - 1} exp \bigg(-\frac{nv}{2\sigma^2} \bigg) \cdot \frac{1}{\sigma^2} \propto (\sigma^2)^{-\frac{n}{2} - 1} exp \bigg(-\frac{c}{\sigma^2} \bigg)\] \[(\text{where} c = \frac{nv}{2})\]
Let \((\sqrt{a}, \sqrt{b})\) and \((a, b)\) be the 95% highest posterior density intervals for \(\sigma\) and \(\sigma^2\) respectively. Then:
\[a^{-\frac{n}{2} - \frac{1}{2}} exp\bigg(-\frac{c}{a} \bigg) = b^{-\frac{n}{2} - \frac{1}{2}} exp\bigg(-\frac{c}{b} \bigg)\] \[a^{-\frac{n}{2} - 1} exp\bigg(-\frac{c}{a} \bigg) = b^{-\frac{n}{2} - 1} exp\bigg(-\frac{c}{b} \bigg)\]
Or equivalently:
\[\bigg(-\frac{n}{2} - \frac{1}{2} \bigg) log(a) - \frac{c}{a} = \bigg(-\frac{n}{2} - \frac{1}{2} \bigg) log(b) - \frac{c}{b}\] \[\bigg(-\frac{n}{2} - 1 \bigg) log(a) - \frac{c}{a} = \bigg(-\frac{n}{2} - 1 \bigg) log(b) - \frac{c}{b}\] \[\implies \frac{1}{2} log(a) = \frac{1}{2} log(b)\] Which will be true only if \(a = b\) but then the 95% credible interval will be just a point containing zero probability mass, and therefore won’t be a valid interval & we have a contradiction.
Let \(y\) be distributed as \(\text{Poisson}(x_i \lambda)\) where \(\lambda\) is the rate parameter and \(x_i\) is an exposure variable, and \(\lambda\) have a \(\text{Gamma}(\alpha, \beta)\) prior distribution. Then:
\[p(\lambda \lvert y) \propto p(y \lvert \lambda)p(\lambda) = \prod_{i=1}^n \frac{(x_i \lambda)^{y_i} e^{-x_i \lambda}}{y_i!} \cdot \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda^{\alpha - 1} e^{- \beta \lambda}\] \[\propto \lambda^{ \alpha + \sum_{i = 1}^n y_i - 1} e^{- (\beta + \sum_{i=1}^n x_i) \lambda}\] \[\implies \lambda \lvert y \sim \text{Gamma} \bigg(\alpha + \sum_{i=1}^n y_i, \beta + \sum_{i=1}^n x_i \bigg)\]
Show that if \(y\) is \(\text{Exponential}(\theta)\) distributed, then the conjugate prior for \(\theta\) is Gamma:
\[p(y \lvert \theta) = \prod_{i= 1}^n \theta e^{-\theta y_i} = \theta^n e^{-\theta\sum_{i = 1}^n y_i}\] \[p(\theta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha - 1} \beta^{\beta - 1}\]
\[\implies p(\theta \lvert y) \propto \theta^{\alpha + n - 1}e^{-(\beta + \sum_{i=1}^n y_i) \theta}\] \[\implies \theta \lvert y \sim \text{Gamma}(\alpha + n, \beta + \sum_{i=1}^n y_i)\]
Show that equivalnt prior specification for mean, \(\phi = \frac{1}{\theta}\) is Inverse-Gamma.
\[\phi = \phi(\theta) = \frac{1}{\theta}\] \[\theta = \theta(\phi) = \frac{1}{\phi}\] \[\frac{\partial \theta}{\partial \phi} = - \frac{1}{\phi^2}\]
\[p_\phi (\phi) = p_\theta(\theta (\phi)) \cdot \bigg\lvert \frac{\partial \theta}{\partial \phi} \bigg\lvert \propto \bigg( \frac{1}{\phi} \bigg)^{\alpha - 1} e^{-\frac{\beta}{\phi}} \cdot \frac{1}{\phi^2} = \phi^{-\alpha - 1} e^{-\frac{\beta}{\phi}}\] \[\implies \phi \sim \text{Inverse-Gamma}(\alpha, \beta)\] ### 2.19 (c)
The length of a life of a lightbulb is distributed \(\text{Exponential}(\theta)\). Suppose \(\theta\) has a prior Gamma distribution with coefficient of variation (standard deviation divided by mean) of 0.5. If the coefficient of variation is to be reduced to 0.1, how many lightbulbs are to be tested?
\[\frac{\sqrt{\frac{\alpha}{\beta^2}}}{\frac{\alpha}{\beta}} = 0.5\] \[\implies \frac{1}{\sqrt{\alpha}} = 0.5\] \[\implies \frac{1}{\alpha} = 0.25 \implies \alpha = 4\]
The posterior distribution for \(\theta\) will be \(\text{Gamma}(\alpha + n , \beta + \sum_{i=1}^n y_i)\), therefore:
\[\frac{1}{\sqrt{\alpha + n}} = 0.1\] \[\implies \frac{1}{\sqrt{4 + n}} = 0.1\] \[\implies \frac{1}{n + 4} = 0.01\] \[\implies 100 = n + 4 \implies n = 96\]
Thus we need to test at least 96 lightbulbs.
Suppose \(y\) is \(\text{Exponential}(\theta)\) distributed and the prior distribution for \(\theta\) is \(\text{Gamma}(\alpha, \beta)\). We observe \(y \geq 100\) but don’t know the exact value of \(y\). What is the posterior distribution for \(\theta \lvert y \geq 100\) as a function of \(\alpha\) and \(\beta\)? What’s the posterior mean and variance?
\[p(y \geq 100 \lvert \theta) = 1 - (1 - e^{-100 \theta}) = e^{-100 \theta}\] \[p(\theta) \propto \theta^{\alpha - 1} e^{-\beta \theta}\] \[\implies p(\theta \lvert y \geq 100) = \theta^{\alpha - 1} e^{- (\beta + 100) \theta}\] \[\implies \theta \lvert y \sim \text{Gamma}(\alpha, \beta + 100)\]
Thus:
\[E(\theta \lvert y) = \frac{\alpha}{\beta + 100}\] \[Var(\theta \lvert y) = \frac{\alpha}{(\beta + 100)^2}\]
What is the posterior distribution, the posterior mean and variance if we’re told \(y=100\)?
From question 2.19 (a), we know that \(\theta\) will have a \(\text{Gamma}(\alpha + 1, \beta + 100)\) distribution. Therefore:
\[E(\theta \lvert y) = \frac{\alpha + 1}{\beta + 100}\] \[Var(\theta \lvert y) = \frac{\alpha + 1}{(\beta + 100)^2}\]
Explain why posterior variance for \(\theta \lvert y = 100\) is greater than that of \(\theta \lvert y \geq 100\).
On average, the variance of \(\theta\) decreases with more information:
\[E(Var(\theta \lvert y) \lvert y \geq 100) \leq Var(\theta \lvert y \geq 100)\] \(y = 100\) is just one possible value & so the variance of its corresponding posterior is not necessarily lower than that of \(\theta \lvert y \geq 100\); only averaging across across all possible values \(y \geq 100\) is guaranteed to result in lower variance.
The file pew_research_center_june_elect_wknd_data.dta has data from Pew Research Centre polls taken during 2008 elections.
Graph the proportion of liberals in each state vs. Obama vote share
dt1 <- haven::read_dta('pew_research_center_june_elect_wknd_data.dta')
dt1$state <- haven::as_factor(dt1$state)
dt1 <- subset(dt1, state != 'hawaii') # Drop Hawaii
dt2 <- read.csv('2008ElectionResult.csv')
dt2 <- subset(dt2, !(state %in% c('Alaska', 'Hawaii'))) # Drop Alaska & Hawaii
dt2$state <- tolower(dt2$state)
dt2[dt2$state == 'district of columbia', 1] <- 'washington dc'
# Calculate percentage of "Very liberal" voters
pct_lib <- tapply(dt1$ideo, dt1$state, function(x) mean(x == 5, na.rm = TRUE))
pct_lib <- pct_lib[!is.na(pct_lib)]
dt3 <- data.frame(state = sort(unique(dt1$state)), pct_lib)
dt3 <- merge(dt3, dt2, all.x = TRUE)
dt3 <- dt3[order(levels(dt3$state)), ]
plot(dt3$pct_lib, dt3$vote_Obama_pct / 100,
axes = FALSE, type = 'n',
xlab = 'Proportion of "very liberal" voters',
ylab = "Obama's vote share", xlim = c(0, 0.12))
text(dt3$pct_lib, dt3$vote_Obama_pct / 100,
label = dt3$state)
axis(1, tick = FALSE)
axis(2, tick = FALSE)
box(bty = 'L', col = 'grey60')
A hypothetical study is performed where a random sample of 100 college students is first invited to throw 100 free throws to establish their baseline ability. The students then take 50 practice shots each day for one month and then are invited back to take 100 shots for final measurements. Give three prior distributions for \(\theta\), improvement in success probability:
A non-informative prior.
Since \(\theta\) is constrained to lie between -1 and 1 (if a student either makes all 100 free throws on the first go and hits none on the second go or vice versa, respectivelly), then the natural choice for a non-informative prior is a \(\text{Uniform}(-1, 1)\).
A subjective prior based on your best knowledge.
We can assume that most students will probably improve with a month of practice, although it’s questionable how much. From my (poor) knowledge of basketball, I’d assume the students to make on average about 10/100 free throws on the first go and about 30/100 throws on the second go. Assuming that, a quick and dirty reasonable prior might be \(\text{Normal(0.2, 0.2)}\) i.e. one centered on the expected difference but with a fairly wide margin for a better or worse overall trend, with a very low prior probability of \(\theta\) exceeding 1 or -1 (with more effort, we could construct a prior that’s actually bounded between those values, such as scaled Beta, but that should not be necessary with 100 participants).
A weakly informative prior
A good, quick and dirty weakly informative prior might be \(\text{Normal}(0, 0.5)\), with a fairly diffuse probability mass across the range of reasonable \(\theta\)’s. Again, this prior suffers from allowing values of \(\theta\) smaller than -1 and greater than 1, however, that again should not be a problem with the amount of data.
Suppose data \((y_1, y_2, \ldots , y_k)\) comes from a \(\text{Multinomial}(\theta_1, \theta_2, \ldots, \theta_k)\) distribution, and we have a \(\text{Dirichlet}(\alpha_1, \alpha_2, \ldots, \alpha_k)\) prior on the \(\theta\)’s. Let \(\alpha = \frac{\theta_1}{\theta_1 + \theta_2}\).
Write the marginal posterior distribution of \(\alpha\):
\[p(\theta_1, \theta_2 \lvert \mathbf{y}) \propto \theta_1^{\alpha_1 + y_1 -1} \theta_2^{\alpha_2 + y_2 - 1} (1 - \theta_1 - \theta_2)^{\sum_{i \not\in \{1, 2 \}} \alpha_i + y_i - 1}\]
\[\alpha = \alpha(\theta_1, \theta_2) = \frac{\theta_1}{\theta_1 + \theta_2}; \qquad \beta = \beta(\theta_1, \theta_2) = \theta_1 + \theta_2\] \[\theta_1 = \theta_1(\alpha, \beta) = \alpha \beta; \qquad \theta_2 = \theta_2(\alpha, \beta) = \beta - \alpha \beta = \beta (1 - \alpha)\] \[\lvert D \lvert = \Bigg\lvert \det \Bigg( \begin{bmatrix} \beta & \alpha \\ -\beta & 1 - \alpha \end{bmatrix} \Bigg) \Bigg\lvert = \lvert \beta (1 - \alpha) - \alpha (- \beta) \lvert = \beta\] \[\implies p(\alpha, \beta \lvert y) = p_{\theta_1, \theta_2} (\theta_1 (\alpha, \beta), \theta_2(\alpha, \beta)) \cdot \lvert D \lvert\] \[\propto (\alpha \beta)^{\alpha_1 + y_1 - 1} [\beta (1 - \alpha)]^{\alpha_2 + y_2 - 1} [1 - \alpha \beta - \beta(1 - \alpha)]^{\sum_{i \not\in \{1, 2 \}} \alpha_i + y_i - 1} \cdot \beta\] \[= \alpha^{\alpha_1 + y_1 - 1} (1 - \alpha)^{\alpha_2 + y_2 - 1} \beta^{\alpha_1 + \alpha_2 + y_1 + y_2 - 1} (1 - \beta)^{\sum_{i \not\in \{1, 2 \}} \alpha_i + y_i - 1}\]
Thus:
\[p(\alpha \lvert y) \propto \int_{\beta} p(\alpha, \beta \lvert y) d\beta\] \[= \alpha^{\alpha_1 + y_1 - 1} (1 - \alpha)^{\alpha_2 + y_2 - 1} \int_0^1 \beta^{\alpha_1 + \alpha_2 + y_1 + y_2 - 1} (1 - \beta)^{\sum_{i \not\in \{1, 2 \}} \alpha_i + y_i - 1} d\beta\] \[\alpha^{\alpha_1 + y_1 - 1} (1 - \alpha)^{\alpha_2 + y_2 - 1} \cdot c\] \[\implies \alpha \lvert y \sim \text{Beta}(\alpha_1 + y_1, \alpha_2 + y_2)\]
Show that this distribution is identical to treating \(y_1\) as an observation from \(\text{Binomial}(y_1 + y_2, \alpha)\):
\[p(\alpha) \propto \alpha^{\alpha_1 - 1} (1 - \alpha)^{\alpha_2 - 1}\] \[p(y_1 \lvert \alpha) \propto \alpha^{y_1}(1 - \alpha)^{y_2}\] \[\implies p(\alpha \lvert y_1) \propto \alpha^{\alpha_1 + y_1 - 1}(1 - \alpha)^{\alpha_2 + y_2 - 1}\]
In pre-debate poll of 1988 presidential debate, 294 respondents showed a preference for George W. Bush and 307 showed preference for Michael Dukakis. In the post-debate poll, 288 showed preference for Bush and 332 for Dukakis.
For \(\alpha_j\), \(j=1, 2\), let \(\alpha_j\) be the proportion of voters who preferred Bush, out of those who had preference for either Bush or Dukakis at poll time \(j\). Plot histogram of the posterior density of \(\alpha_2 - \alpha_1\). What is the posterior probability that there was a shift towards Bush?
a1 <- rbeta(1e4, 295, 308)
a2 <- rbeta(1e4, 289, 333)
hist(a2 - a1, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(alpha[1] - alpha[2]), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
There’s only around 0.19 probability that the post-debate opinion shifted towards Bush.
Measurement of the effect of magnetic fields on the calcium flow in chickens’ brains were taken for two groups of chickens. In the control group, the mean of 32 measurements was 1.013 with a standard deviation of 0.24. In the treatment group, the mean of 36 measurements was 1.173 with a standard deviation of 0.2.
Assuming a uniform prior on \((\mu_c, \mu_t, log(\sigma_c), log(\sigma_t))\) and that the samples were taken from a normal distribution, what are the posterior distributions of \(\mu_c\) and \(\mu_t\)?
\[\mu_c \lvert \mathbf{y} = t_{31}(1.013, 0.24^2 / 32)\] \[\mu_t \lvert \mathbf{y} = t_{35}(1.173, 0.20^2 / 36)\]
What is the posterior distribution for \(\mu_t - \mu_c\)? Plot histogram & give approximate 95% credible interval.
mu_c <- (0.24 / sqrt(32)) * rt(1e4, 31) + 1.013
mu_t <- (0.2 / sqrt(36)) * rt(1e4, 35) + 1.173
mu_diff <- mu_t - mu_c
mu_diff_ci <- quantile(mu_diff, c(0.025, 0.975))
hist(mu_diff, axes = FALSE, main = NULL,
breaks = 20, col = 'steelblue', border = 'white',
xlab = expression(mu[t] - mu[c]), ylab = 'Count')
abline(v = mu_diff_ci, lty = 'dashed', col = 'grey60')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Patients received either treatment or control. Out of 674 patients receiving the control, 39 died, out of the 680 patients receiving treatment, 22 died. Assume that the outcomes are Binomially distributed, with probabilities \(p_0\) and \(p_1\) for control and treament groups respectively.
Set up a non-informative prior distribution on \((p_0, p_1)\) and obtain posterior samples.
# Sample thetas from independent Beta distributions
# with non-informative Beta(1, 1) priors
theta0_1 <- rbeta(1e4, 40, 675)
theta1_1 <- rbeta(1e4, 23, 680)
or1 <- (theta1_1 / (1 - theta1_1)) / (theta0_1 / (1 - theta0_1))
# Sample thetas from independent Beta distributions
# with a Jeffrey's Beta(1/2, 1/2) prior
theta0_2 <- rbeta(1e4, 39.5, 674.5)
theta1_2 <- rbeta(1e4, 22.5, 679.5)
or2 <- (theta1_2 / (1 - theta1_2)) / (theta0_2 / (1 - theta0_2))
hist(or1, axes = FALSE, main = NULL,
breaks = seq(0, 2, 0.05),
col = colt('steelblue', 0.5), border = 'white',
xlab = 'Odds ratio', ylab = 'Count')
hist(or2, axes = FALSE, main = NULL,
breaks = seq(0, 2, 0.05),
col = colt('indianred', 0.25), border = 'white',
xlab = 'Odds ratio', ylab = 'Count', add = TRUE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
quantile(or1, c(0.025, 0.975))
## 2.5% 97.5%
## 0.3257389 0.9489914
quantile(or2, c(0.025, 0.975))
## 2.5% 97.5%
## 0.3240375 0.9460091
Discuss sensitivity of your inference to your choice of non-informative prior.
The inference doesn’t seem to be very strongly affected by either the choice of a “flat”, non-informative \(\text{Beta}(1, 1)\) prior or a Jeffrey’s \(\text{Beta}(1/2, 1/2)\) prior.
Let \(y = (10, 10, 12, 11, 9)\) be five measurements of an object rounded to the nearest pound. Assume that the unbounded measurements are normally distributed with mean \(\mu\) and variance \(\sigma^2\).
Give the posterior distribution obtained by pretending the observations are exact, unbounded measurements.
y <- c(10, 10, 12, 11, 9)
(post_mean <- mean(y))
## [1] 10.4
(s2 <- var(y))
## [1] 1.3
(post_var <- s2 / 5)
## [1] 0.26
Assuming a uniform prior on \((\mu, log(\sigma))\), the posterior for \((\mu, \sigma^2)\) will be:
\[p(\mu, \sigma^2 \lvert y) \propto \sigma^{-n - 2} exp \bigg( -\frac{1}{2\sigma^2} [(n-1)s^2 + n(\bar y -\mu) ] \bigg)\]
The marginal posterior for \(\mu\) will be:
\[\mu \lvert y \sim t_{4} \bigg(10.4, \frac{1.3}{5} \bigg)\] The marginal posterior for \(\sigma^2\) will be:
\[\sigma^2 \lvert y \sim \text{Inverse-} \chi^2 (4, 1.3)\]
# Compute posterior density two different ways:
# 1) Rescale y by post. mean and variance & feed it into dt()
# 2) Draw t(0, 1, 4) samples & rescale by post. mean and variance
mu <- seq(4, 16, 0.01)
post_mu <- dt((mu - 10.4) / sqrt(post_var), 4)
post_mu <- post_mu / sum(post_mu)
mu_s <- sqrt(post_var) * rt(1e4, 4) + 10.4
hi <- hist(mu_s, axes = FALSE, main = NULL, freq = FALSE,
breaks = 30, col = 'steelblue', border = 'white',
xlab = expression(mu), ylab = 'Density')
lines(mu, post_mu * (max(hi$density) / max(post_mu)),
type = 'l', col = 'indianred', lwd = 2)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
legend('topright', lwd = 4,
col = c('steelblue', 'indianred'),
legend = c('Posterior draws', 'Posterior density'))
# The posterior draws, the grid-computed analytic density,
# and a frequentist t-test give roughly the same answers
mu[sapply(c(0.025, 0.975), function(x)
min(which(cumsum(post_mu) > x)))]
## [1] 8.99 11.81
quantile(mu_s, c(0.025, 0.975))
## 2.5% 97.5%
## 8.982174 11.821722
t.test(y)
##
## One Sample t-test
##
## data: y
## t = 20.396, df = 4, p-value = 3.412e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.984285 11.815715
## sample estimates:
## mean of x
## 10.4
# For variance, I'll just plot the histogram of posterior samples
# (could do the same as above though)
sigma2_s <- (4 * s2) / rchisq(1e3, 4) # Sample from Inverse-Chisquare
hist(sigma2_s, axes = FALSE, main = NULL, freq = FALSE,
breaks = 50, col = 'steelblue', border = 'white',
xlab = expression(sigma^2), ylab = 'Density')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Give the correct posterior distribution for \((\mu, \sigma^2)\) treating the measurements as rounded.
Again, using a noninformative prior density that is uniform on \((\mu, log(\sigma))\):
\[p(\mu, \sigma^2 \lvert y) \propto \frac{1}{\sigma^2} \prod_{i=1}^n \bigg( \Phi \bigg( \frac{y_i + 0.5 - \mu}{\sigma} \bigg) - \bigg( \frac{y_i - 0.5 - \mu}{\sigma} \bigg) \bigg)\]
How do the correct and incorrect posterior distributions differ?
mu <- seq(10.4 - 3, 10.4 + 3, length = 100)
log_sigma <- seq(-1, 2, length = 100)
llfun1 <- function(a, b) {
ll <- 0
for (i in 1:5) ll <- ll + dnorm(y[i], a, b, log = TRUE)
return(ll)
}
llfun2 <- function(a, b) {
ll <- 0
for (i in 1:5) ll <- ll + log(pnorm(y[i] + 0.5, a, b) -
pnorm(y[i] - 0.5, a, b))
return(ll)
}
post1 <- outer(mu, exp(log_sigma), llfun1)
post2 <- outer(mu, exp(log_sigma), llfun2)
post1 <- exp(post1 - max(post1))
post2 <- exp(post2 - max(post2))
par(mfrow = c(1, 2))
contour(mu, log_sigma, post1, drawlabels = FALSE, axes = FALSE,
xlab = expression(mu), ylab = expression(log(sigma)),
main = 'Unbounded', col = 'steelblue')
box(bty = 'L', col = 'grey60')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
contour(mu, log_sigma, post2, drawlabels = FALSE, axes = FALSE,
xlab = expression(mu), ylab = expression(log(sigma)),
main = 'With rounding', col = 'steelblue')
box(bty = 'L', col = 'grey60')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
The two posteriors look largely similar. The posterior that account for rounding has slightly wider range for \(log(\sigma)\), suggesting that correctly accounting for rounding introduces more uncertainty into our estimates of the variance of \(y\).
Let \(z = (z_1, z_2, z_3, z_4, z_5)\) be the original unbounded measurements. Draw simulations from the posterior distribution and and compute the posterior mean of \((z_2 - z_1)^2\)
I decided to tackle the problem using rejection sampling:
# First draw imperfect samples from the unbounded posterior
mu_s <- rnorm(1e4, post_mean, sqrt(post_var))
sigma2_s <- (4 * s2) / rchisq(1e3, 4)
p <- runif(1e4)
py <- sum(log(pnorm(y, )))
ll <- sapply(y, function(x) log(pnorm(y + 0.5, mu_s, sigma2_s) -
pnorm(y - 0.5, mu_s, sigma2_s)))
lik <- exp(rowSums(ll))
lik <- lik / max(lik)
p <- runif(1e4)
mu_s2 <- mu_s[lik > p]
sigma2_s2 <- sigma2_s[lik > p]
hist(mu_s2)
We observed data \(y_i\) coming from a \(\text{Binomial}(N, \theta)\) distribution with unknown probability \(\theta\) and unknown sample size \(N\). Raftery (1988) modelled \(N\) as coming from \(\text{Poisson}(\mu)\) distribution, with a prior on \(\lambda = \mu \theta\).
Let the prior distribution be \(p(\lambda, \theta) = \frac{1}{\lambda}\). What is the motivation for this non-informative prior distribution? Is the distribution proper? Transform to determine \(p(N, \theta)\).
The above prior distribution, \(p(\lambda, \theta) = \frac{1}{\lambda}\), is a standard non-informative prior distribution that is uniform on \(log(\lambda)\). The distribution is clearly improper (since it does not integrate to 1).
To find the prior distribution on \(p(N, \theta\), we first need to find \(p(\mu, \theta)\) by applying the change of variable technique:
\[\mu = \mu(\lambda, \theta) = \frac{\lambda}{\theta}; \qquad \theta = \theta(\lambda, \theta) = \theta\] \[\lambda = \lambda(\mu, \theta) = \mu \theta; \qquad \theta = \theta(\mu, \theta) = \theta\] \[\lvert D \lvert = \begin{vmatrix} \theta & \mu \\ 0 & 1 \end{vmatrix} = \theta\] \[p(\theta, \mu) = p_{\lambda, \theta}(\lambda(\mu, \theta), \theta(\mu, \theta)) \lvert D \lvert = \frac{1}{\mu \theta} \cdot \theta = \frac{1}{\mu}\]
Hence, by independence: \(p(\mu) = \frac{1}{\mu}\). Now to find the prior \(p(N, \mu)\):
\[p(N, \mu) = p(N \lvert \mu)p(\mu) = \frac{\mu^N e^{- \mu}}{N!} \cdot \frac{1}{\mu}\] \[= \frac{\mu^{N - 1} e^{-\mu}}{N!}\]
Finally, to get the prior on \(N\), we need to integrate across \(\mu\):
\[p(N) = \int_{0}^\infty \frac{\mu^{N - 1} e^{-\mu}}{N!} d \mu\] \[= \frac{1}{N!} \int_0^\infty \mu^{N - 1} e^{-\mu} d\mu\] \[= \frac{1}{N!} \cdot \Gamma(N)\] \[= \frac{(N - 1)!}{N!} = \frac{1}{N}\]
Hence the prior for \((N, \theta)\) will be:
\[p(N, \theta) = \frac{1}{N}\] And the posterior will be:
\[p(N, \theta \lvert y) \propto \frac{1}{N} \cdot \prod_{i=1}^n {N \choose y_i} \theta^{ y_i} (1 - \theta)^{N - y_i}\]
And the posterior for \(N\) is:
\[p(N \lvert y) \propto \frac{1}{N} \Bigg[ \prod_{i=1}^n {N \choose y_i} \Bigg] \int_0^1 \theta^{\sum_{i=1}^n y_i} (1 - \theta)^{\sum_{i=1}^n N - y_i}\] \[= \frac{1}{N} \Bigg[ \prod_{i=1}^n {N \choose y_i} \Bigg] \cdot \frac{\Gamma(1 + \sum_{i=1}^n y_i) \Gamma(1 + \sum_{i=1}^n N - y_i)}{\Gamma(2 + \sum_{i=1}^n N)}\] \[= \frac{1}{N} \frac{(\sum_{i=1}^n y_i)! (nN - \sum_{i=1}^ny_i)!}{N(nN + 1)!} \Bigg[ \prod_{i=1}^n {N \choose y_i} \Bigg]\]
Counts of waterbuck were observed on five separate days in the Kruger Park in South Africa: \(y = (53, 57, 66, 67, 72)\). Obtain the posterior for \((N, \theta)\) and display a scatterplot of posterior simulations. What is the posterior probability of \(N > 100\)?
y <- c(53, 57, 66, 67, 72)
N <- max(y):1000
theta <- seq(0.01, 0.99, 0.01)
prior <- outer(N, theta, function(a, b) 1/a)
lpost <- prior
sum_y <- sum(y)
# Calculate the log of the posterior density across the grid
for (i in seq_len(nrow(lpost))) {
for (j in seq_len(ncol(lpost))) {
lpost[i, j] <- -log(N[i]) + sum(lchoose(N[i], y)) +
sum_y * log(theta[j]) +
(5 * N[i] - sum_y) * log(1 - theta[j])
}
}
contour(N, theta, exp(lpost), drawlabels = FALSE,
col = 'steelblue', axes = FALSE,
xlab = 'N', ylab = expression(theta))
axis(1, tick = FALSE)
axis(2, las = 1, tick = FALSE)
box(bty = 'L', col = 'grey60')
# Marginal posterior density for N
post_N <- rowSums(exp(lpost))
post_N <- post_N / sum(post_N)
plot(post_N, type = 'l', col = 'steelblue',
lwd = 2, axes = FALSE,
xlab = 'N', ylab = 'Posterior density')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
sum(post_N[N > 100])
## [1] 0.9518627
There is about 95% posterior probability than \(N\) is greater than 100.
Why not simply use a Poisson with fixed \(\mu\) as a prior distribution for N?
If we used a \(p(N) = \frac{\mu^N e^{-\mu}}{N!}\) prior, then the joint posterior density would not factorize as nicely as above.
A students sits on a street corner for an hour and records the number of bicycles \(b\) and the number of other vehicles \(v\). Two models are considered:
Show that the models have the same likelihood if we define \(p = \frac{\theta_b}{\theta_b + \theta_v}\):
\[p(b, v \lvert \theta_b, \theta_v) = \frac{\theta_b^b e^{-\theta_b}}{b!} \cdot \frac{\theta_v^v e^{-\theta_v}}{v!} = \frac{\theta_b^b \theta_v^v e^{- \theta_b - \theta_v}}{b! v!}\]
Now apply change of variable:
\[p = p(\theta_b, \theta_v) = \frac{\theta_b}{\theta_b + \theta_v}; \qquad z = z(\theta_b, \theta_v) = \theta_b + \theta_v\] \[\theta_b = \theta_b (p, z) = pz; \qquad \theta_v = \theta_v (p, z) = z - pz = z(1 - p)\] \[\lvert D \lvert = \begin{vmatrix} z & p \\ -z & 1 - p \end{vmatrix} = z(1 - p) - p(-z) = z\] \[\implies p_{p,z }(b, v \lvert p, z) = p_{\theta_b, \theta_v} (\theta_b(p,z) , \theta_v (p,z)) \cdot \lvert D \lvert\] \[= \frac{(pz)^b [z(1 - p)]^v e^{-zp -z(1 - p)}}{b! v!}\] \[= \frac{1}{b! v!} p^b z^{b + v} (1 - p)^v e^{-z}\] \[= \frac{1}{b! v!} p^b (1 - p)^v z^{b + v} e^{-z}\]
Now we integrate across \(z\):
\[p(b \lvert p, v) = \frac{1}{b! v!} p^b (1 - p)^v \int_{0}^\infty z^{b + v} e^{-z} dz\] \[= \frac{1}{b! v!} p^b (1 - p)^v \cdot \Gamma (b + v + 1)\] \[= \frac{(b + v)!}{b! v!} p^b (1 - p)^v\] \[= {b + v \choose b} p^b (1 - p)^v = p(b \lvert p, v)\]
A survey was done in which sixty city blocks were selected at random, each block was observed for an hour, an the number of bicycles and other vehicles were recorded. Data from street with and without bike routes were compared.
w_lanes_bic <- c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55)
w_lanes_all <- c(58, 90, 48, 57, 103, 57, 86, 112, 273, 64)
wo_lanes_bic <- c(12, 1, 2, 4, 9, 7, 9, 8)
wo_lanes_all <- c(113, 18, 14, 44, 208, 67, 29, 154)
Let \(y_1, y_2, \ldots, y_{10}\) and \(z_1, z_2, \ldots, z_{8}\) be the observed proportions of bicycles out of overall traffic on streets with and without lanes. Set up the model so that \(y_i\)’s and \(z_i\)’s are independent and identically distributed with parameters \(\theta_y\) and \(\theta_z\)
\[y_i \sim \text{Poisson}(a_i \theta_y)\] \[z_i \sim \text{Poisson}(b_i \theta_z)\]
\[\implies p(y_i \lvert \theta_y, a_i) = \frac{(a_i \theta_y)^{y_i} e^{- a_i \theta_y}}{y_i!}; \qquad p(z_i \lvert \theta_z, b_i) = \frac{(b_i \theta_z )^{z_i} e^{- b_i \theta_z}}{z_i!}\]
Set up a prior distribution that is independent in \(\theta_y\) and \(\theta_z\):
\[p(\theta_y, \theta_z) = p(\theta_y) \cdot p(\theta_z) = \frac{1}{\theta_y} \cdot \frac{1}{\theta_z}\]
Determine the posterior and draw 1,000 simulations from the posterior distribution:
\[p(\theta_y, \theta_z \lvert y, z, a, b) = p(\theta_y \lvert y, a) \cdot p(\theta_z \lvert z, b)\] \[= \frac{1}{\theta_y} \cdot \prod_{i=1}^{10} \frac{(a_i \theta_y)^{y_i} e^{- a_i \theta_y}}{y_i!} \cdot \frac{1}{\theta_z} \prod_{i=1}^8 \frac{(b_i \theta_z)^{z_i} e^{-b_i \theta_z}}{z_i!}\] \[\propto \bigg( \theta_y^{\sum_{i=1}^{10} y_i -1} e^{- \theta_y \sum_{i=1}^{10} a_i} \bigg) \bigg( \theta_z^{\sum_{i=1}^8 z_i} e^{- \theta_z \sum_{i=1}^8 b_i} \bigg)\]
theta_y <- rgamma(1e3, sum(w_lanes_bic), sum(w_lanes_all))
theta_z <- rgamma(1e3, sum(wo_lanes_bic), sum(wo_lanes_all))
Let \(\mu_y = E(y_i \lvert \theta_y)\) be the mean of the distribution of \(y_i\)’s and \(\mu_z = E(z_i \lvert \theta_z)\) be the mean of the distribution of \(z_i\)’s. Use the posterior simulations from (c) to compute \(\mu_y\) and \(\mu_z\) and plot a histogram of \(\mu_y - \mu_z\).
mu_y <- sapply(theta_y, function(x)
mean(rpois(10, x * w_lanes_all)))
mu_z <- sapply(theta_z, function(x)
mean(rpois(8, x * wo_lanes_all)))
hist(mu_y - mu_z, main = NA, breaks = 20,
axes = FALSE, col = 'steelblue', border = 'white',
xlab = expression(mu[y] - mu[z]), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Let \(y_i \sim \text{Normal}(\mu, \sigma^2)\) and the prior distribution on \((\mu, \sigma^2)\) be \(\text{N-Inv-}\chi^2 (\mu_0,\sigma_0^2 / \kappa_0; \nu_0, \sigma^2_0)\). The posterior for \((\mu, \sigma^2)\) is also Normal-Inverse-\(\chi^2\): derive its parameters in terms of prior parameters and sufficient statistics.
\[p(\mu, \sigma^2) \propto \sigma^{-1} exp \bigg(-\frac{\kappa_0}{2 \sigma^2}(\mu - \mu_0)^2 \bigg) \cdot (\sigma^2)^{- (\nu_0/2 + 1)} exp \bigg(-\frac{1}{2 \sigma^2} (\nu_0 \sigma_0^2) \bigg)\] \[= \sigma^{-1} (\sigma^2)^{- (\nu_0 /2 + 1)} exp \bigg(-\frac{1}{2\sigma^2} [\nu_0 \sigma_0^2 + \kappa_0 (\mu_0 - \mu)^2] \bigg) \]
\[p(\mu, \sigma^2 \lvert y) = p(y \lvert \mu, \sigma^2) p(\mu, \sigma^2)\] \[\propto (\sigma^2)^{-\frac{n}{2}} exp \bigg(-\frac{(n - 1)s^2 + n(\mu - \bar y)^2}{2\sigma^2} \bigg) \cdot \sigma^{-1} (\sigma^2)^{- (\nu_0 /2 + 1)} exp \bigg(-\frac{1}{2\sigma^2} [\nu_0 \sigma_0^2 + \kappa_0 (\mu_0 - \mu)^2] \bigg) \] \[\propto= \sigma^{-1}(\sigma^2)^{-\frac{n + \nu_0}{2} - 1} exp \bigg( \frac{\nu_0 \sigma_0^2 + (n - 1)s^2 + \mu^2(n + \kappa_0) - 2\mu(n \bar y + \kappa_0 \mu_0) + n \bar y^2 + \kappa_0 \mu_0^2 }{2 \sigma^2} \bigg)\] \[\propto= \sigma^{-1}(\sigma^2)^{-\frac{n + \nu_0}{2} - 1} exp \bigg( \frac{\nu_0 \sigma_0^2 + (n - 1)s^2 + \frac{n \kappa_0 (\bar y - \mu_0)^2}{n + \kappa_0} + (n + \kappa_0)(\mu - \frac{n \bar y + \kappa_0 \mu_0}{n + \kappa_0})^2}{2 \sigma^2} \bigg) \qquad \text{(completing the square)}\] Thus:
\[\mu, \sigma^2 \lvert y \sim \text{N-Inv-}\chi^2 \bigg( \frac{n \bar y + \kappa_0 \mu_0}{n + \kappa_0}, \frac{\sigma_n^2}{n + \nu_0}; n + \nu_0, \sigma^2_n \bigg)\] where:
\[\sigma^2_n = \frac{\nu_0 \sigma_0^2 + (n-1)s^2 + \frac{n \kappa_0 (\bar y - \mu_0)^2}{n + \kappa_0}}{n + \nu_0}\]
We observe 5 independent observations from Cauchy distribution with an unknown parameter \(\theta\): \((y_1, \ldots, y_5) = (-2, -1, 0, 1.5, 2.5)\).
Determine the first and second derivative of log-posterior density:
\[p(\mathbf{y} \lvert \theta) = \prod_{i=1}^5 \frac{1}{1 + (y_i - \theta)^2}\]
\[log(p(\mathbf{y} \lvert \theta)) = \sum_{i=1}^5 log \bigg( \frac{1}{1 + (y_i - \theta)^2} \bigg) = -\sum_{i=1}^5 log(1 + (y_i - \theta)^2)\]
\[\frac{\partial}{\partial \theta} log(p(\mathbf{y} \lvert \theta)) = -\sum_{i=1}^5 \frac{\partial}{\partial \theta} log(1 + (y_i - \theta)^2)\] \[= - \sum_{i=1}^5 \frac{1}{1 + (y_i - \theta)^2} \cdot 2(y_i - \theta) \cdot(-1)\] \[= 2 \sum_{i=1}^5 \frac{y_i - \theta}{1 + (y_i - \theta)^2}\]
\[\frac{\partial^2}{\partial \theta^2} log(p(\mathbf{y} \lvert \theta)) = 2 \sum_{i=1}^5 \frac{\partial}{\partial \theta} \frac{y_i - \theta}{1 + (y_i - \theta)^2}\] \[= 2 \sum_{i=1}^5 \frac{(-1)[1 + (y_i - \theta)^2] - 2 \cdot (y_i - \theta) \cdot (-1) \cdot (y_i - \theta)}{[1 + (y_i - \theta)^2]^2}\] \[= 2 \sum_{i=1}^5 \frac{(y_i - \theta)^2 - 1}{[1 + (y_i - \theta)^2]^2}\]
To find the posterior mode, we can use numerical optimization:
y <- c(-2, -1, 0, 1.5, 2.5)
scorefun <- function(theta) {
if (theta < 0 || theta > 1) return(Inf)
2 * sum((y - theta) / (1 + (y - theta)^2)^2)
}
mode <- uniroot(scorefun, c(0, 1))$f.root
Calculate the normal approximation:
I <- -2 * sum(((y - mode)^2 - 1) / (1 + (y - mode)^2)^2)
var_theta <- 1 / I
theta <- seq(0, 1, 0.01)
post_true <- sapply(theta, function(x) exp(-sum(log(1 + (y - x)^2))))
post_true <- post_true / sum(post_true)
post_approx <- dnorm(theta, mode, sqrt(var_theta))
post_approx <- post_approx / sum(post_approx)
plot(theta, post_approx, type = 'l',
axes = FALSE, lwd = 2, col = 'firebrick',
xlab = expression(theta), ylab = 'Density')
lines(theta, post_true, lwd = 2, col = 'steelblue', type = 'l')
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
We have four observations from four independent experiments:
\[\mathbf{y} = (0, 1, 3, 5)\] \[\mathbf{n} = (5, 5, 5, 5)\] \[\mathbf{x} = (-0.86, -0.3, -0.05, 0.73)\]
\[y_i \lvert \theta_i \sim \text{Binomial}(n_i, \theta_i)\] \[log \bigg( \frac{\theta_i}{1 - \theta_i} \bigg) = \alpha + \beta x_i \implies \theta_i = \frac{e^{\alpha + \beta x_i}}{1 + e^{\alpha + \beta x_i}}\]
The likelihood is:
\[p(\mathbf{y} \lvert \alpha, \beta) = \prod_{i=1}^4 {5 \choose y_i} \bigg( \frac{e^{\alpha + \beta x_i}}{1 + e^{\alpha + \beta x_i}} \bigg)^{y_i} \bigg( 1 - \frac{e^{\alpha + \beta x_i}}{1 + e^{\alpha + \beta x_i}} \bigg)^{5 - y_i}\] \[= \prod_{i=1}^4 {5 \choose y_i} \bigg( \frac{e^{\alpha + \beta x_i}}{1 + e^{\alpha + \beta x_i}} \bigg)^{y_i} \bigg( \frac{1}{1 + e^{\alpha + \beta x_i}} \bigg)^{5 - y_i}\]
\[log(p(\mathbf{y} \lvert \alpha, \beta)) \propto \sum_{i=1}^4 y_i \cdot (\alpha + \beta x_i) - y_i \cdot log(1 + e^{\alpha + \beta x_i}) - 5 \cdot log(1 + e^{\alpha + \beta x_i}) + y_i \cdot log(1 + e^{\alpha + \beta x_i})\] \[= \sum_{i=1}^4 y_i \cdot (\alpha + \beta x_i) - 5 \cdot log(1 + e^{\alpha + \beta x_i})\]
\[\frac{\partial}{\partial \alpha} log(p(y_i \lvert \alpha, \beta)) = y_i - 5 \cdot \frac{e^{\alpha + \beta x_i}}{1 + e^{\alpha + \beta x_i}}\] \[\frac{\partial}{\partial \beta} log(p(y_i \lvert \alpha, \beta)) =y_i x_i - 5 \cdot \frac{x_i \cdot e^{\alpha + \beta x_i}}{1 + e^{\alpha + \beta x_i}}\]
\[\frac{\partial^2}{\partial \alpha^2} log(p(y_i \lvert \alpha, \beta)) = - 5 \cdot \frac{e^{\alpha + \beta x_i}(1 + e^{\alpha + \beta x_i}) - e^{\alpha + \beta x_i} \cdot e^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2}\] \[= - \frac{5e^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2}\]
\[\frac{\partial^2}{\partial \alpha \partial \beta} log(p(y_i \lvert \alpha, \beta)) = - 5 \cdot \frac{x_i e^{\alpha + \beta x_i} (1 + e^{\alpha + \beta x_i}) - x_i e^{\alpha + \beta x_i} e^{\alpha + beta x_i}}{(1 + e^{\alpha + \beta x_i})^2}\] \[= -\frac{5x_i e^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2}\]
\[\frac{\partial^2}{\partial \beta^2} log(p(y_i \lvert \alpha, \beta)) = - 5 \cdot \frac{x_i^2e^{\alpha + \beta x_i}(1 + e^{\alpha + \beta x_i}) - x_ie^{\alpha + \beta x_i} \cdot x_ie^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2}\] \[= -\frac{5x_i^2 e^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2}\]
Therefore:
\[I(\hat \theta) = \begin{bmatrix} -\sum_{i=1}^n \frac{5e^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2} & -\sum_{i=1}^n \frac{5x_ie^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2} \\ -\sum_{i=1}^n \frac{5x_ie^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2} & -\sum_{i=1}^n \frac{5x_i^2e^{\alpha + \beta x_i}}{(1 + e^{\alpha + \beta x_i})^2} \end{bmatrix}\]
To find the posterior mode, we can use numerical optimization:
y <- c(0, 1, 3, 5)
x <- c(-0.86, -0.3, -0.05, 0.73)
llfun <- function(theta) {
if(any(theta < 0)) return(Inf)
a <- theta[1]; b <- theta[2]
-sum(y * (a + b * x) - 5 * log(1 + exp(a + b * x)))
}
mode <- optim(c(1, 1), llfun)$par
a <- mode[1]
b <- mode[2]
# Second derivaties evaluated at mode
plpa2 <- -sum( (5 * exp(a + b * x)) / (1 + exp(a + b * x))^2)
plpab <- -sum( (5 * x * exp(a + b * x)) / (1 + exp(a + b * x))^2)
plpb2 <- -sum( (5 * x^2 * exp(a + b * x)) / (1 + exp(a + b * x))^2)
I <- matrix(c(plpa2, plpab, plpab, plpb2), ncol = 2)
var_theta <- -solve(I)
# Draw samples from the approximate posterior
draws_approx <- MASS::mvrnorm(2e3, mode, var_theta)
# Compute the true posterior across a grid of values
anew <- seq(-3, 8, 0.025)
bnew <- seq(-10, 35, 0.1)
post_true <- matrix(nrow = length(anew),
ncol = length(bnew))
for (i in seq_along(anew)) {
for (j in seq_along(bnew)) {
ps <- exp(anew[i] + bnew[j] * x) / (1 + exp(anew[i] + bnew[j] * x))
post_true[i, j] <- prod(dbinom(y, 5, ps))
}
}
contour(anew, bnew, post_true, col = 'firebrick',
axes = FALSE, drawlabels = FALSE,
xlab = expression(alpha), ylab = expression(beta))
points(draws_approx[, 1], draws_approx[, 2],
pch = 19, col = colt('steelblue', 0.05))
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
legend('topright', col = c('firebrick', 'steelblue'),
legend = c('True posterior', 'Normal approximation (samples)'),
lwd = 1, border = 'grey60')
From the previous question, we have:
# Posterior mode
mode
## [1] 0.846716 7.750138
# Posterior variance
var_theta
## [,1] [,2]
## [1,] 1.038716 3.547155
## [2,] 3.547155 23.753307
Let \(\mathbf{\theta}\) be the parameter vector \((\alpha, \beta)\). Now the function we want to approximate is LD50:
\[\text{LD50} = g(\mathbf{\theta}) = - \frac{\alpha}{\beta}\] By the invariance of MLE’s, we know that:
\[\text{MLE}(g(\mathbf{\theta})) = g( \mathbf{\hat \theta}) = -\frac{\hat \alpha}{\hat \beta}\] Now, as for the Delta method, we can expand \(g(\mathbf{\theta})\) as a Taylor series around the posterior mode:
\[g(\mathbf{\hat \theta}) \approx g(\mathbf{\theta}) + \nabla g(\mathbf{\theta})^T( \mathbf{\hat \theta} - \mathbf{\theta})\] where \(g(\theta)\) is the transformation of the true parameter, and \(\nabla g(\theta)\) is the Jacobian of the transformation. We can use this to obtain the variance of the transformed parameters:
\[Var(g(\mathbf{\theta})) \approx Var(g(\mathbf{\theta}) + \nabla g(\mathbf{\theta})^T( \mathbf{\hat \theta} - \mathbf{\theta}))\] \[= Var(g(\mathbf{\theta}) + \nabla g(\mathbf{\theta})^T \mathbf{\hat \theta} - \nabla g(\mathbf{\theta})^T \mathbf{\theta})\] \[= Var(\nabla g(\mathbf{\theta})^T \mathbf{\hat \theta}) \qquad \text{(since } g(\mathbf{\theta}) \text{is a constant)}\] \[= \nabla g(\mathbf{\theta})^T \Sigma g(\mathbf{\theta}) \nabla g(\mathbf{\theta})\] Additionally:
\[\nabla g(\mathbf{\theta}) = \begin{bmatrix} \frac{\partial}{\partial \alpha} g(\mathbf{\theta}) \\ \frac{\partial}{\partial \beta} g(\mathbf{\theta}) \end{bmatrix} = \begin{bmatrix} -\frac{1}{\beta} \\ \frac{\alpha}{\beta^2} \end{bmatrix} \]
We approximate this with the MLE values:
\[\nabla g(\mathbf{\hat \theta}) = \begin{bmatrix} -\frac{1}{\beta} \\ \frac{\alpha}{\beta^2} \end{bmatrix} \approx \begin{bmatrix} -\frac{1}{7.75} \\ \frac{0.85}{(7.75)^2} \end{bmatrix}\] Therefore:
\[Var(\mathbf{\hat \theta}) \approx \begin{bmatrix} -\frac{1}{7.75} & \frac{0.85}{(7.75)^2} \end{bmatrix} \begin{bmatrix} 1.04 & 3.55 \\ 3.55 & 2.75 \end{bmatrix} \begin{bmatrix} -\frac{1}{7.75} \\ \frac{0.85}{(7.75)^2} \end{bmatrix}\]
Compute this numerically in R:
jacob <- c(-1/mode[2], mode[1] / mode[2]^2)
mode_ld50 <- - mode[1] / mode[2]
var_ld50 <- as.numeric(t(jacob) %*% var_theta %*% jacob)
(waldci_ld50 <- mode_ld50 + c(-1, 0, 1) * 1.96 * sqrt( var_ld50))
## [1] -0.2963231 -0.1092517 0.0778196
The Wald confidence interval seems to roughly correspond to the histogram on page 87.
Assuming regularity conditions, we know that \(p(\theta \lvert y)\) approaches normality as \(n \to \infty\). In addition, if \(\phi= f(\theta)\) is any one-to-one continuous transformation of \(\theta\), we can express the Bayesian inference in terms of \(\phi\) and find that \(p(\phi \lvert y)\) also approaches normality. But a non-linear transformation is no longer normal. How can both limiting normal distributions be valid?
As \(n \to \infty\), posterior variance approaches zero, and so \(p(\theta \lvert y)\) will converge to a single point. Any one-to-one continuous transformation will be locally linear in the neighbourhood of that point.
Let \(X\) and \(Y\) be independent normally distributed random variables where \(X\) has mean 4 and standard deviation 1, and \(Y\) has mean 3 and standard deviation 2. What is the approximate mean and variance of \(\frac{Y}{X}\)?
Compute using simulation:
set.seed(1234)
x <- rnorm(1e4, 4, 1)
y <- rnorm(1e4, 3, 2)
z <- y/x
hist(z, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(y/x), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
mean(z)
## [1] 0.8003443
var(z)
## [1] 0.3873014
Determine without using simulation.
To approximate the ratio, we have to use a (bivariate) Taylor series approximation. We have the function:
\[f(x, y) = \frac{Y}{X}\]
The derivatives of this function with respect to \(X\) and \(Y\) are:
\[f_x' = -\frac{Y}{X^2}; \qquad f_x'' = \frac{2X}{Y^3}\] \[f_{xy}'' = f_{yx}'' = -\frac{1}{X^2}\] \[f_y' = \frac{1}{X}; \qquad f_y'' = 0\] A first order Taylor series approximation of \(f(x, y)\) about the joint mean/centroid \(\mathbf{\mu} = (\mu_X, \mu_Y)\) is:
\[f(x, y) \approx f(\mathbf{\mu}) + f'_x(x - \mu_X) + f_y'(y - \mu_Y)\] If we take an expectation of this, we get:
\[E[f(x, y)] \approx E[f(\mathbf{\mu}) + f'_x(\mathbf{\mu})(x - \mu_X) + f_y'(\mathbf{\mu})(y - \mu_Y)]\] \[= E[f(\mathbf{\mu})] + f'_x(\mathbf{\mu}) E[(x - \mu_X)] + f_y'(\mathbf{\mu}) E[(y - \mu_Y)]\] \[E[f(\mathbf{\mu})] + 0 + 0\] \[= f(\mathbf{\mu}) = \frac{\mu_Y}{\mu_X}\]
However, we can do better. A second order Taylor series approximation of the function \(f(x, y)\) is:
\[f(x, y) \approx f(\mathbf{\mu}) + f'_x(\mathbf{\mu})(x - \mu_X) + f_y' (\mathbf{\mu})(y - \mu_Y) + \frac{1}{2} \bigg[ f_{xx}''(\mathbf{\mu}) (x - \mu_X)^2 + 2 f_{xy}(\mathbf{\mu}) (x - \mu_X)(y - \mu_Y) + f_{yy}''(\mathbf{\mu})(y \ mu_y)^2 \bigg]\]
Again, taking the expectation, we get:
\[E[f(x, y)] \approx E[f(\mathbf{\mu})] + 0 + 0 + \frac{1}{2} \cdot \frac{2\mu_Y}{\mu_X^3} \cdot E[(x - \mu_X)^2] + \bigg( - \frac{1}{\mu_X^2} \bigg) \cdot E[(x - \mu_X)(y - \mu_Y)] + 0 \cdot E[(y - \mu_Y)^2]\] \[= \frac{\mu_Y}{\mu_X} + \frac{\mu_Y}{\mu_X^3} Var(X) - \frac{1}{\mu_X^2} Cov(X, Y)\]
Since the two variables are independent, \(Cov(X, Y) = 0\) and so:
\[E \bigg( \frac{Y}{X} \bigg) = \frac{\mu_Y}{\mu_X} + \frac{\mu_Y}{\mu_X^3} Var(X) = \frac{3}{4} + \frac{3}{64} \cdot 1^2 \approx 0.797\]
Which is pretty close to the answer we got from the simulation above.
As for variance, technically \(Var \bigg( \frac{Y}{X} \bigg) = \infty\) as \(n \to \infty\), if the denominator \(X\) assigns nonzero probability for values close to zero, which it does since \(X\) is a normally distributed. However, that shouldn’t bother us too much if the mean of \(X\) is far from zero and the standard deviation is small, meaning that we are unlikely to get those troublesome values near zero. Therefore, we can approximate variance with the Delta method (i.e. a Taylor series shortcut):
\[Var \bigg( \frac{Y}{X} \bigg) = \mathbf{G V G^T}\]
where \(\mathbf{G}\) is the Jacobian matrix:
\[\mathbf{G} = \begin{bmatrix} -\frac{Y}{X^2} \\ \frac{1}{X} \end{bmatrix} \approx \begin{bmatrix} -\frac{3}{16} \\ \frac{1}{4} \end{bmatrix}\] \[\implies Var \bigg( \frac{X}{Y} \bigg) \approx \begin{bmatrix} -\frac{3}{16} & \frac{1}{4} \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 4 \end{bmatrix} \begin{bmatrix} -\frac{3}{16} \\ \frac{1}{4} \end{bmatrix} \approx 0.285\]
G <- c(-3/16, 1/4)
V <- matrix(c(1, 0, 0, 4), ncol = 2)
approx_mean <- 3/4 + 3/64
approx_var <- t(G) %*% V %*% G
znew <- seq(min(z), max(z), length.out = 100)
hist(z, axes = FALSE, main = NULL, freq = FALSE,
col = 'steelblue', border = 'white',
xlab = expression(y/x), ylab = 'Count', ylim = c(0, 0.75))
lines(znew, dnorm(znew, approx_mean, sqrt(approx_var)),
col = 'firebrick', lwd = 2)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
The Taylor approximation seems to work pretty well. The variance is slightly underestimated, as could be expected due to the non-linearity of the transformation.
A decision-theoretic approach to estimation of an unknown parameter \(\theta\) is to introduce a loss function \(L(\theta, a)\) which gives the cost of deciding that a parameter has cost \(a\) when it is in fact equal to \(\theta\). The estimate can be chosen to minimize expected loss:
\[E(L(\theta, a)) = \int L(\theta, a) p(\theta \lvert y) d\theta\] This optimal choice for \(a\) is called the Bayes estimate for loss function \(L\). Show that:
If \(L(\theta, a) = (\theta - a)^2\) (squared error loss), then the posterior mean, \(E(\theta \lvert y)\), if it exists, is the unique Bayes estimate for \(\theta\).
\[\frac{\partial}{\partial a} \int L(\theta, a) p(\theta \lvert y) d \theta = \frac{\partial}{\partial a} \int (\theta - a)^2 p(\theta \lvert y) d \theta\] \[= \int \frac{\partial}{\partial a} (\theta - a)^2 p(\theta \lvert y) d \theta\] \[= \int 2(\theta - a) (-1) p(\theta \lvert y) d \theta\] \[= \int 2(a - \theta ) p(\theta \lvert y) d \theta\] \[= 2 \cdot \int a \cdot p(\theta \lvert y) d \theta - 2 \cdot \int \theta \cdot p (\theta \lvert y ) d \theta\] \[= 2a \cdot \int p(\theta \lvert y) d \theta - 2 E(\theta \lvert y) = 2a - 2E(\theta \lvert y)\] Clearly, this will be zero when \(a\) is the posterior mean \(E(\theta \lvert y)\).
If $L(, a) = - a $, then the posterior median of \(\theta\) is the Bayes estimate of \(\theta\).
Using the answer from the following question, (c):
\[k_0 = k_1 = 1\] \[\implies \frac{k_0}{k_0 + k_1} = \frac{1}{1 + 1} = \frac{1}{2}\]
Thus, the \(a\) that minimizes the loss function is the 50th percentile of the posterior cumulative density function, i.e. the median.
If \(k_0\) and \(k_1\) are non-negative numbers, both nonzero, and:
\[L(\theta, a) = \begin{cases} k_0(\theta - a) & \text{if } \theta \geq a \\ k_1(a - \theta) & \text{if } \theta < a \end{cases}\]
\[\frac{\partial}{\partial a} E(L(a \lvert y)) = \frac{\partial}{\partial a} \Bigg[ \int_a^\infty k_0 (\theta - a) p(\theta \lvert y) d \theta + \int_{-\infty}^a k_1(a - \theta) p(\theta \lvert y) d \theta \Bigg]\] \[= \int_a^\infty \frac{\partial}{\partial a} k_0 (\theta - a) p(\theta \lvert y) d \theta + \int_{-\infty}^a \frac{\partial}{\partial a} k_1(a - \theta) p(\theta \lvert y) d \theta\] \[= - k_0 \int_{a}^\infty p(\theta \lvert y) d \theta + k_1 \int_{-\infty}^ a p(\theta \lvert y) d \theta\] \[= k_1 \int_{-\infty }^ a p(\theta \lvert y) d \theta - k_0 \bigg[1 - \int_{-\infty}^a p(\theta \lvert y) d \theta \bigg]\] \[= (k_0 + k_1) \int_{-\infty}^a p(\theta \lvert y) d \theta - k_0\]
Now set to zero:
\[ (k_0 + k_1) \int_{-\infty}^a p(\theta \lvert y) d \theta = k_0\] \[\implies \int_{-\infty}^a p(\theta \lvert y) d \theta = \frac{k_0}{k_0 + k_1}\]
Therefore, \(a\) is the \(\frac{k_0}{k_0 + k_1}\)th quantile of the posterior distribution.
Show that the posterior mean, based on a proper prior distribution, cannot be an unbiased estimator except in degenerate problems.
Let the posterior mean be \(m(y) = E(\theta \lvert y)\). For it to be unbiased estimator of \(\theta\), the following would have to hold: \(E( m(y) \lvert \theta) = \theta\). However, then we’d have:
\[E(\theta m(y) ) = E[E(\theta m(y) \lvert \theta)] = E[\theta \cdot E(m(y) \lvert \theta)] = E[\theta^2]\] \[E(\theta m(y) ) = E[E(\theta m(y) \lvert y)] = E[m(y) \cdot E( \theta \lvert y)] = E[m(y)^2]\] Now, the variance of the estimator will be:
\[E[(m(y) - \theta)^2] = E[m(y)^2] - 2 E[m(y)] E[\theta] + E[\theta^2]\] \[= E[m(y)^2] - 2 \theta^2 + \theta^2\] \[= E[m(y)^2] - \theta^2 = E[m(y)^2] - E[\theta^2]\]
Using the above, we’d get \(E[m(y)^2] - E[\theta^2] = E(\theta m(y) ) - E(\theta m(y) ) = 0\)
So if the posterior mean is unbiased as an estimator, its variance has to be zero. This is only the case when we have either infinite amount of data or other degenerate cases. So in all other scenarios, the posterior mean won’t be unbiased.
Let the height of mothers and daughters be normally distributed with known means of 160, equal variances, and correlation of 0.5. The posterior mean of \(\theta\) is:
\[E(\theta \lvert y) = 160 + 0.5 (y - 160)\] and given daughter’s height, mother’s height is distributed as:
\[E(y \lvert \theta) = 160 + 0.5(\theta - 160)\] So the expectation of the posterior mean is:
\[E[E(\theta \lvert y)] = E[160 + 0.5(y - 160)]\] \[= 160 + 0.5 \cdot [E(y) - 160]\] \[= 160 + 0.5 [160 + 0.5(\theta - 160) - 160]\] \[= 160 + 0.25(\theta - 160)\] As such, the expectation of the posterior mean is shrunk towards the grand mean of 160 and is therefore biased. In contrast, an estimate:
\[\hat \theta = 160 + 2(y - 160)\]
is unbiased since:
\[E[\hat \theta] = E[160 + 2(y - 160)]\] \[= 160 + 2 \cdot E[y] - 2 \cdot 160\] \[= 2 \cdot [160 + 0.5(\theta - 160)] - 160\] \[= 2 \cdot [80 + 0.5 \theta] - 160 = \theta\] However, even though unbiased, this estimate clearly doesn’t make sense since it estimates the parameter to be more extreme than the data: e.g. for a mother who’s 10 centimeters taller than average, it will estimate that the daughter is 20 centimeters taller than average (i.e. the exact opposite of regression to the mean). This shows a downside to frequentist estimation, where an unknown quantity has to be treated differently depending on whether it’s a “parameter” or a “prediction”.
For each of the following three examples, give answer: (i) Are observations \(y_1\) and \(y_2\) exchangeable? (ii) Are observations \(y_1\) and \(y_2\) independent? (iii) Can we act as if the observations are independent?
A box has one black ball and one white ball. We pick a ball \(y_1\) at random, put it back, and pick another ball \(y_2\) at random.
The observations are independent and therefore exchangeable. Picking one color for \(y_1\) does not influence the likelihood of picking another color for \(y_2\), and the order in which we pick the balls does not matter. We can happily acts as if the observations are independent, since they are.
A box has one black ball and one white ball. We pick a ball \(y_1\) at random, we do not put it back, and pick another ball \(y_2\).
In this case, the observations are not independent, but they are still exchangeable. We can either pick the white ball first and then the black, or the black ball first and then the white. Both sequences have the same joint probability (1/2), so they are exchangeable. We cannot act as if the observations are independent, since the likelihood of \(y_2\) is conditional on (entirely determined by) \(y_1\).
A box has one million black balls and one million white ball. We pick a ball \(y_1\) at random, we do not put it back, and pick another ball \(y_2\).
In this case, the observations are again not independent (technically) but are exchangeable. If we pick e.g. white ball as \(y_1\), this infinitessimally decreases our likelihood of picking a white ball again for \(y_2\) (and so the observations aren’t independent). However, if we draw two white balls, two black balls, or a black ball and a white ball, the order in which we do so does not affect the joint probability (so the observations are exchangeable). It is justifiable to treat the observations as independent since there are so many balls to sample from so that
Reproduce the computations in Section 5.5 for the educational testing example. Use posterior simulations to estimate: (i) for each school \(j\), the probability that its coaching program is best of the eight; and (ii) for each pair of school, \(j\) and \(k\), the probability that the program in school \(j\) is better than in school \(k\)
school <- LETTERS[1:8]
treatment_effect <- c(28, 8, -3, 7, -1, 1, 18, 12)
std_error <- c(15, 10, 16, 11, 9, 11, 10, 18)
pooled_mean <- sum(treatment_effect / std_error^2) / sum(1 / std_error^2)
pooled_var <- 1 / sum(1 / std_error^2)
# p(tau | y)
p_tau_given_y <- function(tau) {
prec <- sum(1 / (std_error^2 + tau^2))
mu_hat <- (sum(treatment_effect / (std_error^2 + tau^2)) / sum(1 / (std_error^2 + tau^2)))
(1 / sqrt(prec)) * prod((std_error^2 + tau^2)^(-1/2) *
exp(- (treatment_effect - mu_hat)^2 / (2 * (std_error^2 + tau^2))))
}
tau <- seq(0.01, 40, 0.01)
plot(tau, sapply(tau, p_tau_given_y), type = 'l',
lwd = 2, col = 'steelblue',
xlab = expression(tau), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
tau_prob <- sapply(tau, p_tau_given_y)
tau_prob <- tau_prob / sum(tau_prob)
tau_cdf <- cumsum(tau_prob)
# Inverse cdf sampling method
rtau <- function(n = 1) {
ps <- runif(n)
tau[sapply(ps, function(x) min(which(tau_cdf > x)))]
}
mu_hat <- function(tau) sum(treatment_effect / sum(std_error^2 + tau^2) /
sum(1 / (std_error^2 + tau^2)))
mu_prec <- function(tau) sum(1 / (std_error^2 + tau^2))
theta_hat <- function(mu, tau, i) (treatment_effect[i] / std_error[i]^2 + mu / tau^2) /
(1 / std_error[i]^2 + 1 / tau^2)
tau_samples <- rtau(1e4)
mu_hat <- sapply(tau, function(x) {
sum(treatment_effect / (std_error^2 + x^2)) / sum(1 / (std_error^2 + x^2))
})
mu_prec <- sapply(tau, function(x){
sum(1 / (std_error^2 + x^2))
})
mu_samples <- rnorm(1e4, mu_hat, 1 / sqrt(mu_prec))
theta_hat <- matrix(0, nrow = 1e4, ncol = 8)
theta_prec <- matrix(0, nrow = 1e4, ncol = 8)
theta_samples <- matrix(0, nrow = 1e4, ncol = 8)
for (i in 1:1e4) {
for (j in 1:8) {
theta_hat <- (treatment_effect[j] / std_error[j]^2 + mu_samples[i] / tau_samples[i]^2) /
(1 / std_error[j]^2 + 1 / tau_samples[i]^2)
theta_prec <- 1 / std_error[j]^2 + 1 / tau_samples[i]^2
theta_samples[i, j] <- rnorm(1, theta_hat, sqrt(1 / theta_prec))
}
}
# Posterior means
colMeans(theta_samples)
## [1] 11.832258 8.248265 6.448166 7.991295 5.595439 6.482169 11.087138
## [8] 8.976625
# Standard deviations of the posterior means
apply(theta_samples, 2, sd)
## [1] 10.360701 8.553280 10.056311 8.798396 8.493761 8.970413 8.926155
## [8] 10.203950
pr_best <- table(apply(theta_samples, 1, which.max))
pr_best <- pr_best / sum(pr_best)
names(pr_best) <- school
pr_better_than <- matrix(0, nrow = 8, ncol = 8)
for (i in 1:8) {
for (j in 1:8) {
pr_better_than[i, j] <- mean(theta_samples[, i] > theta_samples[, j])
}
}
colnames(pr_better_than) <- school; rownames(pr_better_than) <- school
pr_best
## A B C D E F G H
## 0.2457 0.1012 0.0832 0.0991 0.0522 0.0702 0.2080 0.1404
round(pr_better_than, 2)
## A B C D E F G H
## A 0.00 0.64 0.69 0.64 0.73 0.70 0.52 0.61
## B 0.36 0.00 0.57 0.51 0.63 0.58 0.37 0.47
## C 0.31 0.43 0.00 0.44 0.54 0.50 0.32 0.41
## D 0.36 0.49 0.56 0.00 0.62 0.57 0.37 0.47
## E 0.27 0.37 0.46 0.38 0.00 0.46 0.27 0.36
## F 0.30 0.42 0.50 0.43 0.54 0.00 0.30 0.40
## G 0.48 0.63 0.68 0.63 0.73 0.70 0.00 0.59
## H 0.39 0.53 0.59 0.53 0.64 0.60 0.41 0.00
The probabilities I got were very similar to those in the worked out solution, although sometimes slightly different. This is probably due to Monte Carlo error (Gelman et al. used only 1,000 posterior samples, I used 10,000).
Repeat the same but for simple model with \(\tau\) set to infinity.
If \(\tau\) is \(\infty\), the eight schools are independent, and so their respective posterior distributions are \(\theta_j \sim \text{Normal}(y_j, \sigma_j^2)\). Then:
\[P(\theta_i > \theta_j \lvert y) = P(\theta_i - \theta_j > 0 \lvert y) = P(\zeta > 0 \lvert y)\]
where \(\zeta\) is a normal variable with mean \(E(\zeta) = E(\theta_i) - E(\theta_j)\) and variance \(Var(\zeta) = Var(\theta_i) + Var(\theta_j) - Cov(\theta_i, \theta_j) = \sigma_i^2 + \sigma_j^2 - 0\). The probability will therefore be:
\[P(\theta_i > \theta_j) = \Phi \Bigg( \frac{y_i - y_j}{\sqrt{\sigma_i^2 + \sigma_j^2}} \Bigg)\]
To estimate the probability of being the best, I simulated 10,000 draws from the independent distributions:
sims2 <- sapply(1:8, function(x) rnorm(1e4, mean = treatment_effect[x], sd = std_error[x]))
pr_best2 <- table(apply(sims2, 1, which.max))
pr_best2 <- pr_best2 / sum(pr_best2)
names(pr_best2) <- school
pr_better_than2 <- matrix(0, nrow = 8, ncol = 8)
for (i in 1:8) {
for (j in 1:8) {
pr_better_than2[i, j] <- pnorm((treatment_effect[i] - treatment_effect[j]) /
sqrt(std_error[i]^2 + std_error[j]^2))
}
}
colnames(pr_better_than2) <- school; rownames(pr_better_than2) <- school
pr_best2
## A B C D E F G H
## 0.5512 0.0329 0.0238 0.0367 0.0030 0.0135 0.1718 0.1671
round(pr_better_than2, 2)
## A B C D E F G H
## A 0.50 0.87 0.92 0.87 0.95 0.93 0.71 0.75
## B 0.13 0.50 0.72 0.53 0.75 0.68 0.24 0.42
## C 0.08 0.28 0.50 0.30 0.46 0.42 0.13 0.27
## D 0.13 0.47 0.70 0.50 0.71 0.65 0.23 0.41
## E 0.05 0.25 0.54 0.29 0.50 0.44 0.08 0.26
## F 0.07 0.32 0.58 0.35 0.56 0.50 0.13 0.30
## G 0.29 0.76 0.87 0.77 0.92 0.87 0.50 0.61
## H 0.25 0.58 0.73 0.59 0.74 0.70 0.39 0.50
Discuss how (a) and (b) differ.
The independent model in (b) places more confidence in the observed data & results in more extreme probabilities. For example, the probability of school A being the best and being higher than other schools is a lot higher than in (a). This is most likely not a desirable behavior, since it is less conservative.
If \(\tau = 0\), all the schools are the same, so we can’t make any comparisons between them.
Suppose it is known a prior that \(2J\) parameters \(\theta_1, \theta_2, \ldots \theta_{2J}\) are clustered into two groups, with one half exactly being drawn from \(\text{Normal}(1, 1)\) and the other from \(\text{Normal}(-1, 1)\), but we do not know which distribution each parameter comes from.
Are \(\theta_1, \theta_2, \ldots \theta_{2J}\) exchangeable under this prior distribution?
Yes, since the order in which the \(\theta\)’s would be sampled was not specified and so all permutations are possible
Show that this distribution cannot be written as a mixture of independent and identically distributed components.
If the density could be written as a mixture of i.i.d. components, we could express it as:
\[p(\theta) = \int \prod_{j=1}^J p(\theta_j \lvert \phi) p(\phi) d \phi\] That would require the mixing component \(p(\phi)\) to be independent as well (i.e. not depend on other \(\theta\)’s). However, since we know that exactly one half of observations come from one distribution & one half from the other, the mixing isn’t independent - if, for example, we sample the first 8 \(\theta\)’s from \(\text{Normal}(1, 1)\), then we know that \(\theta_9\) will come from \(\text{Normal}(-1, 1)\), with probability 1. As such, there is a negative correlation between any pair of \(\theta\)’s, \(\theta_i\) and \(\theta_j\): if \(\theta_i\) is greater than zero, then it is more likely it comes from \(\text{Normal}(1, 1)\) distribution, and in turn it is then more likely than \(\theta_j\) comes from \(\text{Normal}(-1, 1)\) distribution, and vice versa.
Why can’t we not simply take the limit as \(J \to \infty\) and get a counterexample to de Finetti’s theorem?
As \(J \to \infty\), the negative correlation between any pair of \(\theta\)’s goes to 0, and so the joint distribution approaches i.i.d.
Suppose distribution of \(\pmb{\theta} = (\theta_1, \theta_2, \ldots, \theta_J)\) can be written as a mixture of independent and identically distributed components:
\[p(\theta) = \int \prod_{j=1}^J p(\theta_j \lvert \phi) p(\phi) d \phi\]
Prove that covariances \(Cov(\theta_j, \theta_j)\) are all non-negative.
Let \(\mu(\phi) = E(\theta_i \lvert \phi)\) be the expectation of ith \(\theta\) when we know the probability of it coming from one or the other component. Then:
\[Cov(\theta_i, \theta_j) = E(Cov(\theta_i, \theta_j \lvert \phi)) + Cov(E(\theta_i \lvert \phi), E( \theta_j \lvert \phi))\] \[= 0 + Cov(\mu(\phi), \mu(\phi))\] \[= Var(\mu(\phi)) > 0\] ## 5.6 Exchangeable models
In the divorce rate example of Section 5.2, set up a prior for the values \(y_1, y_2, \ldots y_8\) that allows for one low value (Utah) and one high value (Nevada), with independent and identically distributed distributions for the other six values. The prior distribution should be exchangeable, since it is not known which of the 8 states correspond to Utah or Nevada:
\[p(\pmb{\theta}) = \bigg[ 2{8 \choose 2} \bigg]^{-1} \sum_{\text{Permutation}} \text{Normal}(\theta_1 \lvert \mu_1, \sigma^2_1) \cdot \text{Normal}(\theta_2 \lvert \mu_2, \sigma^2_2) \cdot \prod_{i=1}^6 \text{Normal}(\theta_i \lvert \mu_3, \sigma^2_3)\]
The prior is a mixture of three Normal distributions, each with different mean and variance. Additionally, since we don’t know which of the observations belong to Utah or Nevada, we need to average across all the possible permutations of these twos state within the list of all eight states.
Since I don’t know much about divorce rates in the US, I set the parameters \(\mu_1\), \(\mu_2\) and \(\mu_3\) to 1, 6.5, and 12 respectively, and the same variances 2.5.
theta_obs <- c(5.8, 6.6, 7.8, 5.6, 7.0, 7.1, 5.4)
mu <- c(1, 12, 6.5)
sigma <- c(2.5, 2.5, 2.5)
inds <- expand.grid(1:8, 1:8)
inds <- inds[inds[, 1] != inds[, 2], ]
pos <- matrix(3, nrow = 56, ncol = 8)
for (i in 1:56) {
pos[i, inds[i, 1]] <- 1
pos[i, inds[i, 2]] <- 2
}
p_theta <- dnorm(rep(theta_obs, 56), mu[pos[, -8]], sigma[t(pos[, -8])], log = TRUE)
p_theta <- matrix(p_theta, ncol = 7, byrow = TRUE)
p_theta <- sum(exp(rowSums(p_theta))) / 56
theta <- seq(0, 15, 0.01)
theta_mat <- matrix(rep(theta_obs, length(theta)), ncol = 7, byrow = TRUE)
theta_mat <- cbind(theta_mat, theta)
p_theta8 <- apply(theta_mat, 1, function(x) {
p <- dnorm(rep(x, 56), mu[t(pos)], sigma[t(pos)], log = TRUE)
p <- matrix(p, ncol = 8, byrow = TRUE)
sum(exp(rowSums(p)))
sum(exp(rowSums(p))) / 56
sum(exp(rowSums(p))) / 56
})
p_theta8 <- p_theta8 / p_theta
p_theta8 <- p_theta8 / sum(p_theta8)
plot(theta, p_theta8, type = 'l', col = 'steelblue', lwd = 2,
xlab = expression(theta[8]), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Now consider the eight data point \(y_8 = 13.9\). Are these data consisent with the prior distribution you gave in part (b)? In particular, did your prior allow for the possibility that the actual data have an outlier at a high end, but no outlier at the low end?
sum(p_theta8[theta >= 13.9])
## [1] 0.04940172
(1 - sum(p_theta8[theta < 3]))^8
## [1] 0.1422585
The prior did seem to allow for an outlier on the high end (with 4.9% probability for the observed divorce rate for Nevada), and also not observing any outlier on the low end (with probability of not observing any observation less than 3 being about 14.2%). One way to improve the prior might be to make sure that the probability density goes to zero near zero, which it currently doesn’t as a mix of normals.
If \(y \sim \text{Poisson}(\theta)\) and \(\theta \sim \text{Poisson}(\alpha, \beta)\), then the marginal (prior predictive) distribution of \(y\) is a negative binomial with parameters \(\alpha\) and \(\beta\) (or \(p = \beta/(1 + \beta)\)). Find the mean and variance of the negative binomial:
\[E(y) = E[E( y \lvert \theta)] = E(\theta) = \frac{\alpha}{\beta}\] \[Var(y) = E[Var(y \lvert \theta)] + Var[E(y \lvert \theta)] = E(\theta) + Var(\theta) = \frac{\alpha}{\beta} + \frac{\alpha}{\beta^2} + \frac{\alpha(\beta + 1)}{\beta^2}\]
In the normal model with unknown location and scale \((\mu, \sigma^2)\), the non-informative prior density, \(p(\mu, \sigma^2) \propto 1 / \sigma^2\), results in a normal-inverse-\(\chi^2\) posterior distribution for \((\mu, \sigma^2)\). Marginally then, \(\sqrt{n}(\mu - \bar y)/s\) has a \(t_{n-1}\) posterior distribution. Derive mean and variance of this distribution, stating the appropriate condition on \(n\) for existence of both moments:
\[E(\sqrt{n}(\mu - \bar y)/s \lvert y) = E \bigg[\frac{\sqrt{n}}{s} E(\mu - \bar y \lvert \sigma, y) \bigg \lvert y \bigg] = E \bigg[ \frac{\sqrt{n}}{s} \cdot 0 \bigg\lvert y \bigg] = 0\] \[Var(\sqrt{n}(\mu - \bar y)/s \lvert y) = E[Var(\sqrt{n}(\mu - \bar y)/s \lvert \sigma, y) \lvert y] + Var[E(\sqrt{n}(\mu - \bar y)/s \lvert \sigma, y) \lvert y]\] \[= E \bigg[\frac{n}{s^2} Var(\mu \lvert \sigma, y) \bigg\lvert y \bigg]\] \[= E \bigg[ \frac{n}{s^2} \cdot \frac{\sigma^2}{n} \bigg\lvert y \bigg]\] \[= E(\sigma^2 \lvert y)/s^2 = \frac{(n-1)s^2}{(n-3)}/s^2 \qquad \text{(by properties of scaled inverse-} \chi^2)\] \[= \frac{n-1}{n-3}\] Clearly, we must have \(n > 3\) for the variance to be defined.
Let \(p_m(\theta)\) for \(m = 1, 2, \ldots, M\) be a set of conjugate prior densities for the sampling model \(y \lvert \theta\). Show that the class of finite mixture prior densities:
\[p(\theta) = \sum_{m=1}^M \lambda_m p_m(\theta)\]
where \(\lambda_m\)’s are weights that sum to 1, is also conjugate class. Use mixture form to create a bimodal prior density for normal mean that is thought to be near with, with standard deviation 0.5, but also has a small probability of being near -1, with the same standard deviation. If the variance of observations \(y_1, y_2, \ldots y_10\) are known to be 1, and their observed mean is \(\bar y = -0.25\), derive posterior distribution for the mean, making a sketch of both prior and posterior densities.
Since each prior component of the prior mixture density \(p(\theta) = \sum_{m=1}^M \lambda_m p_m(\theta)\) integrates to one, and the \(\lambda_m\)’s add up to one, the prior mixture density will add up to one as well. Additionally, since each prior component has a form that is conjugate to the likelihood, the mixture prior will be conjugate as well.
mu <- seq(-3, 3, 0.01)
prior1 <- dnorm(mu, 1, 0.5)
prior2 <- dnorm(mu, -1, 0.5)
prior <- 0.9 * prior1 + 0.1 * prior2
plot(mu, prior, type = 'l', col = 'steelblue', lwd = 2,
xlab = expression(mu), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
mean1 <- ( (10 * -0.25) / 1 + 1 / 0.5^2) / (10 / 1 + 1 / 0.5^2)
var1 <- 1 / (10/1 + 1 / 0.5^2)
mean2 <- ( (10 * -0.25) / 1 + -1 / 0.5^2) / (10 / 1 + 1 / 0.5^2)
var2 <- 1 / (10/1 + 1 / 0.5^2)
post1 <- dnorm(mu, mean1, sqrt(var1))
post2 <- dnorm(mu, mean2, sqrt(var2))
marg <- dnorm(-0.25, c(1,-1), sqrt(0.5^2 + 1/10))
post_wt <- c(0.9, 0.1) * marg / sum(c(0.9, 0.1) * marg)
post <- post_wt[1] * post1 + post_wt[2] * post2
plot(mu, post, type = 'l', col = 'steelblue', lwd = 2,
xlab = expression(mu), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
# Alternative way of computing, without relying on conjugacy
mu <- seq(-3, 3, 0.01)
prior1 <- dnorm(mu, 1, 0.5)
prior2 <- dnorm(mu, -1, 0.5)
lik <- dnorm(mu, -0.25, sqrt(1/10))
unnorm_post1 <- prior1 * lik
unnorm_post2 <- prior2 * lik
post_wt <- c(sum(0.9 * unnorm_post1), sum(0.1 * unnorm_post2)) /
sum(0.9 * unnorm_post1 + 0.1 * unnorm_post2)
post <- post_wt[1] * unnorm_post1 / sum(unnorm_post1) +
post_wt[2] * unnorm_post2 / sum(unnorm_post2)
plot(mu, post, type = 'l', col = 'steelblue', lwd = 2,
xlab = expression(mu), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
box(bty = 'L', col = 'grey60')
Show that, with uniform prior density on \((log(\alpha / \beta), log(\alpha + \beta))\), the unnormalized posterior density has an infinite integral.
Show that combining a uniform prior distribution on \((\alpha + \beta)^{-\frac{1}{2}}\) and an independent uniform prior distribution on \(\frac{\alpha}{\alpha + \beta}\) yields prior density \(\propto (\alpha + \beta)^{-\frac{5}{2}}\)
The determinant of the Jacobian is:
\[D = \begin{vmatrix} \frac{\beta}{\alpha + \beta} & -\frac{\alpha}{\alpha + \beta} \\ -\frac{1}{2}(\alpha + \beta)^{\frac{3}{2}} & -\frac{1}{2}(\alpha + \beta)^{\frac{3}{2}} \end{vmatrix} = -\frac{\beta}{2(\alpha + \beta)^{\frac{7}{2}}} - \frac{\alpha}{2(\alpha + \beta)^{\frac{7}{2}}} = -\frac{(\alpha + \beta)}{2(\alpha + \beta)^{\frac{7}{2}}} \propto (\alpha + \beta)^{-\frac{5}{2}}\]
Show that the resulting posterior distribution is proper as long as \(0 < y_j < n_j\) for at least one experiment \(j\).
Consider hierarchical normal model in section 5.4. Show that:
If the hyperprior distribution is \(p(\mu, \tau) \propto \tau^{-1}\), show that the posterior density is improper.
The posterior density is:
\[p(\tau \lvert y) \propto p(\tau) V^{1/2}_\mu \prod_{j=1}^J (\sigma^2_j + \tau^2)^{-1/2} \exp \bigg(-\frac{(\bar y_j - \hat \mu)^2}{2(\sigma^2_j + \tau^2)} \bigg)\] where \(V_\mu^{-1} = \sum_{j=1}^J \frac{1}{\sigma_j^2 + \tau^2}\)
As \(\tau \to 0\), everything apart from \(p(\tau)\) converges to a constant. However, if \(p(\tau) \propto \tau^{-1}\), then \(p(\tau) \to 0\) as \(\tau \to 0\), i.e. the integrand is not finite & the resulting posterior is improper.
If the hyperprior distribution is \(p(\mu, \tau) \propto 1\), show that the posterior density is proper if \(J > 2\).
The exponent term of the posterior will be smalle then or equal to one. We can rewrite the rest as:
\[p(\tau \lvert y)^* \propto p(\tau) V^{1/2}_\mu \prod_{j=1}^J (\sigma^2_j + \tau^2)^{-1/2}\] \[\propto 1 \cdot \frac{1}{\sqrt{\sum_{j=1}^J \frac{1}{\sigma_j^2 + \tau^2}}} \cdot \prod_{j=1}^J (\sigma_j^2 + \tau^2)^{-1/2}\] As \(\tau \to \infty\) (or \(\tau \to \text{big}\)), the contribution of the observation variances \(\sigma_j^2\) will drop out. So we can rewrite it as:
\[\propto \bigg(\sum_{j = 1}^J \frac{1}{\tau^2} \bigg)^{-1/2} \cdot (\tau^2)^{-J/2}\] \[= \bigg(\frac{J}{\tau^2} \bigg)^{-1/2} \cdot (\tau^2)^{-J/2}\] \[= \frac{\tau}{\sqrt{J}} \cdot \frac{1}{\tau^{J}} = \frac{1}{\sqrt{J} \tau^{J - 1}}\]
In the limit as \(\tau \to \infty\), if \(J = 1\), the fraction above will be a constant: \(p(\tau \lvert y) \propto \frac{1}{\sqrt{J}}\), and so the integral won’t be finite. However, for \(J > 2\), the integrand will be a decreasing function of \(J\) and so the integral will be finite.
(not sure if it shouldn’t be \(J > 3\), but \(J > 2\) is what the book says)
How would you analyze SAT coaching data if \(J = 2\) (that is, if there were only 2 schools)?
If there are only two schools, it doesn’t seem reasonable to analyze the data as a hierarchical model anymore. The easiest thing to do would be just to do comparison of normal means with known variances.
Suppose in the rat tumor example, we want to use a normal population distribution on the log odds scale: \(\text{logit}{(\theta_j)} \sim \text{Normal}(\mu, \tau^2)\) for \(j = 1, 2, \ldots, J\).
Write the joint posterior density \(p(\pmb{\theta}, \mu, \tau \lvert y)\):
\[p(\pmb{\theta}, \mu, \tau \lvert y) \propto p(\mu, \tau) \cdot p(\theta \lvert \mu, \tau) \cdot p(y \lvert \theta)\] Now, \(p(\theta \lvert \mu, \tau)\) is parametrized in logit units, so we need to find the Jacobian of the transformation:
\[\eta = \log \bigg( \frac{\theta}{1 - \theta} \bigg)\] \[\frac{\partial \eta}{\partial \theta} = \frac{1 - \theta}{\theta} \cdot \frac{1 - \theta - (- \theta)}{(1 - \theta)^2} = \frac{1 - \theta}{\theta} \cdot \frac{1}{(1 - \theta^2)} = \frac{1}{\theta(1 - \theta)}\]
Thus:
\[p(\pmb{\theta}, \mu, \tau \lvert y) \propto p(\mu, \tau) \cdot \prod_{j=1}^J \bigg[ \frac{1}{\theta_j (1 - \theta_j)} \cdot \frac{1}{\tau} \cdot \exp \bigg( - \frac{(\theta_j - \mu)^2}{2 \tau^2} \bigg) \bigg] \cdot \prod_{j=1}^J \theta_j^{y_j} (1 - \theta_j)^{n_j - y_j}\] ### (b)
Show that the integral \(p(\phi \lvert y) = \int p(\theta, \phi \lvert y) d\theta\) (i.e. the marginal posterior of the hyperparameters) has no closed-form expression.
There is no straightforward way to evaluate the integral of the two products in the posterior across the multiple \(\theta_j\)’s. We could evaluate integrals for each of the individual \(\theta_j\)’s, but we can’t do it jointly.
Why is the expression \(p(\phi \lvert y) = \frac{p(\theta, \phi \lvert y)}{p(\theta \lvert \phi, y)}\) no help in this problem?
In order for that expression to be useful, we would have to know \(p(\theta \lvert \mu, \tau, y)\), which we don’t since the prior and the likelihood aren’t conjugate.
Derive analytics expression for \(E(\theta _j \lvert \tau, y)\) and \(Var(\theta_j \lvert \tau, y)\).
\[E(\theta_j \lvert \mu, \tau, y_j) = E(E(\theta_j \lvert \mu, \tau, y_j)) = E \Bigg[ \frac{\frac{y_j}{\sigma_j^2} + \frac{\mu}{\tau^2}}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} \Bigg \lvert \tau, y \Bigg] = \frac{\frac{y_j}{\sigma_j^2} + \frac{\hat \mu}{\tau^2}}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} \] \[Var(\theta_j \lvert \mu, \tau, y_j) = E(Var(\theta_j \lvert \mu, \tau, y_j)) + Var(E(\theta_j \lvert \mu, \tau, y_j))\] \[E \Bigg[ \frac{1}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} \Bigg \lvert \tau, y \Bigg] + Var \Bigg[ \frac{\frac{y_j}{\sigma_j^2} + \frac{\mu}{\tau^2}}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} \Bigg \lvert \tau, y \Bigg]\] \[= \frac{1}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} + \Bigg( \frac{\frac{1}{\tau^2}}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}} \Bigg)^2 \cdot V_\mu\]
where:
\[\hat \mu = \frac{\sum_{j = 1}^J \frac{1}{\sigma_j^2 + \tau^2} \bar y_{\cdot j}}{\sum_{j=1}^J \frac{1}{\sigma_j^2 + \tau^2}} \qquad V_\mu^{-1} = \sum_{j=1}^J \frac{1}{\sigma^2_j + \tau^2}\]
The bicycle lane data from chapter 3:
Set up a model for the data such that, for \(j = 1, 2, \ldots, 10\), the observed number of bicycles at location \(j\) is binomial with unknown probability \(\theta_j\) and sample size equal to the total number of vehicles. Assign Beta population distribution for hte parameters \(\theta_j\), and non-informative prior distribution. Write down the joint posterior distribution.
\[p(\log(\alpha / \beta), \log(\alpha + \beta)) \propto \alpha \beta (\alpha + \beta)^{-5/2}\] \[p(\theta, \alpha, \beta \lvert y) = p(\alpha, \beta) p(\theta \lvert \alpha, \beta) p(y \lvert \theta)\] \[ \propto p(\alpha, \beta) \prod_{j=1}^J \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta_j^{\alpha - 1} (1 - \theta_j)^{\beta - 1} \prod_{j=1}^J \theta_j^{y_j} (1 - \theta_j)^{n_j - y_j}\] \[p(\theta \lvert \alpha, \beta, y) = \prod_{j=1}^J \frac{\Gamma(\alpha + \beta + n_j)}{\Gamma(\alpha + y_) \Gamma(\beta + n_j - y_j)} \theta_j^{\alpha + y_j - 1} (1 - \theta_j)^{\beta + n_j - y_j - 1}\] \[p(\alpha, \beta \lvert y) \propto p(\alpha, \beta) \prod_{j=1}^J \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot \frac{\Gamma(\alpha + y_j) \Gamma(\beta + n_j - y_j)}{\Gamma(\alpha + \beta + n_j)}\] ### (b)
Compute the marginal posterior density of the hyperparameters and draw simulations from the joint posterior distribution of the parameters and hyperparameters:
w_lanes_bic <- c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55)
w_lanes_all <- c(58, 90, 48, 57, 103, 57, 86, 112, 273, 64)
y <- w_lanes_bic
n <- w_lanes_bic + w_lanes_all
s <- seq(-2.2, 0, length = 100)
t <- seq(0, 4.5, length = 100)
to_beta <- function(x, y) exp(y) / (exp(x) + 1)
to_alpha <- function(x, y) exp(x) * to_beta(x, y)
a <- outer(s, t, to_alpha)
b <- outer(s, t, to_beta)
log_prior <- -5/2 * log(a + b)
log_jacob <- log(a) + log(b)
log_lik <- matrix(nrow = length(s), ncol = length(t))
for (i in seq_along(s)) {
for (j in seq_along(t)) {
aa <- a[i, j]; bb <- b[i, j]
log_lik[i, j] <- sum(lgamma(aa + bb) - lgamma(aa) - lgamma(bb) +
lgamma(aa + y) + lgamma(bb + n - y) -
lgamma(aa + bb + n))
}
}
log_post <- log_prior + log_jacob + log_lik
log_post <- log_post - max(log_post)
post <- exp(log_post)
post <- post / sum(post)
contour(s, t, post, col = 'steelblue',
drawlabels = FALSE, xlab = expression(log(alpha / beta)),
ylab = expression(log(alpha + beta)), axes = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
post_s <- rowSums(post)
post_s <- post_s / sum(post_s)
s_inds <- sample(seq_along(s), 1e3, prob = post_s, replace = TRUE)
s_samples <- s[s_inds]
t_samples <- numeric(1e3)
for (i in seq_along(s_samples)) {
cond <- post[s_inds[i], ]
cond <- cond / sum(cond)
t_samples[i] <- sample(t, 1, prob = cond)
}
plot(s_samples, t_samples, pch = 19, col = colt('seagreen', 0.1),
axes = FALSE, xlab = expression(log(alpha / beta)[samples]),
ylab = expression(log(alpha + beta)[samples]),
xlim = c(-2.2, 0), ylim = c(0, 4.5))
contour(s, t, post, col = 'steelblue',
drawlabels = FALSE, add = TRUE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
a_samples <- to_alpha(s_samples, t_samples)
b_samples <- to_beta(s_samples, t_samples)
theta_samples <- matrix(0, ncol = length(y), nrow = 1e3)
par(mfrow = c(5, 2), mar = c(2.5, 2.5, 1.5, 1.5))
for (j in seq_along(y)) {
theta_samples[, j] <- rbeta(1e3, a_samples + y[j], b_samples + n[j] - y[j])
hist(theta_samples[, j], axes = FALSE, main = bquote(theta[.(j)]),
col = 'steelblue', border = 'white', xlim = c(0, 0.6), breaks = seq(0, 1, 0.025),
xlab = expression(tilde(theta)), ylab = 'Count')
points(y[j]/n[j], 0, col = 'firebrick', pch = 19)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
}
Compare the posterior distribution of the parameters \(\theta_j\) to the raw proportions \(y_j / n_j\). How does the inference from the posterior distribution differ from the raw proportions?
theta_quants <- apply(theta_samples, 2, quantile, c(0.025, 0.975))
yhat <- y/n
par(mar = c(5, 5, 2, 2), mfrow = c(1, 1))
plot(seq(0, 0.6, 0.1), seq(0, 0.6, 0.1), type = 'l', col = 'grey60', axes = FALSE,
xlab = expression(hat(theta) == y/n), ylab = expression(hat(theta)[hierarchical]))
segments(yhat, theta_quants[1, ], yhat, theta_quants[2, ], col = 'steelblue')
points(yhat, colMeans(theta_samples), pch = 19, col = 'steelblue')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
# Try in Stan
# NOT RUN
library(cmdstanr)
mod_string1 <- '
data {
int<lower=0> N;
int<lower=0> y[N];
int<lower=0> n[N];
}
parameters {
real<lower=0> alpha;
real<lower=0> beta;
vector<lower=0, upper=1>[N] theta;
}
transformed parameters {
real s;
real t;
s = log(alpha / beta);
t = log(alpha + beta);
}
model {
target += - 2.5 * log(alpha + beta);
theta ~ beta(alpha, beta);
y ~ binomial(n, theta);
}
'
dt1 <- list(N = length(y), y = y, n = n)
mod1 <- cmdstan_model(write_stan_file(mod_string1))
fit1 <- mod1$sample(dt1, adapt_delta = 0.9, iter_sampling = 500, refresh = 0)
st_samples <- fit1$draws(variables = c('s', 't'), format = 'draws_matrix')
plot(st_samples[, 1], st_samples[, 2], pch = 19, col = colt('seagreen', 0.1),
axes = FALSE, xlab = expression(log(alpha / beta)[samples]),
ylab = expression(log(alpha + beta)[samples]),
xlim = c(-2.2, 0), ylim = c(0, 4.5))
contour(s, t, post, col = 'steelblue',
drawlabels = FALSE, add = TRUE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Give a 95% posterior interval for the average underlying proportion of traffic that is bicycles
theta_marg_samples <- rbeta(1e3, a_samples, b_samples)
hist(theta_marg_samples, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white', xlim = c(0, 0.6), breaks = seq(0, 1, 0.05),
xlab = expression(theta), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
quantile(theta_marg_samples, c(0.025, 0.975))
## 2.5% 97.5%
## 0.03569438 0.47130638
A new city block is sampled at random and is a residential street with a bike route. In an hour of observation, 100 vehicles of all kinds go by. Give a 95% posterior interval for the number of those vehicles that are bicycles.
yrep_samples <- rbinom(1e3, 100, theta_marg_samples)
hist(yrep_samples, axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(tilde(y)), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
(yrep_quants <- quantile(yrep_samples, c(0.025, 0.5, 0.975)))
## 2.5% 50% 97.5%
## 3.000 20.000 48.025
We would expect anywhere between 3 to 20` bicycles. Comparing this with other values observed in the sample, that seems like a fairly reasonable (although conservative) estimate.
Was the Beta distribution for the \(\theta_j\)’s reasonable?
If our aim was to estimate the proportion of traffic that the bicycles made up then Beta distribution would seem like a reasonable prior for \(\theta_j\)’s.
Consider the same data as in the previous question, but suppose only the total amount of traffic at each location is observed.
Set up a model in which the total number of vehicles observed at each location \(j\) follows a Poisson distribution with parameter \(\theta_j\), the “true” rate of traffic per hour at that location. Assign a Gamma population distribution for \(\theta_j\)’s and a non-informative hyperprior distribution. Write down the joint posterior distribution.
I’ll try using the same non-informative density as in the previous question.
\[p(\alpha, \beta) \propto (\alpha + \beta)^{-\frac{5}{2}}\] \[p(\alpha, \beta, \theta \lvert y) = p(\alpha, \beta) \cdot p(\theta \lvert \alpha, \beta) \cdot p(y \lvert \theta)\] \[= p(\alpha, \beta) \cdot \prod_{j= 1}^J \frac{\beta^\alpha}{\Gamma(\alpha)} \theta_j^{\alpha - 1} e^{- \beta \theta_j} \cdot \prod_{j=1}^J \frac{\theta_j^{y_j} e^{-\theta_j}}{y_j!}\] \[= p(\alpha, \beta) \cdot \prod_{j = 1}^J \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{1}{y_j!} \cdot \theta_j^{\alpha + y_j - 1} e^{-(\beta + 1) \theta_j}\]
\[p(\theta \lvert \alpha, \beta, y) = \prod_{j=1}^J \frac{(\beta + 1)^{\alpha + u_j}}{\Gamma(\alpha + y_j)} \theta_j^{\alpha + y_j - 1} e^{-(\beta + 1) \theta_j}\] \[p(\alpha, \beta \lvert y) = \frac{p(\alpha, \beta, \theta \lvert y)}{p(\theta \lvert \alpha, \beta, y)}\] \[= p(\alpha, \beta) \cdot \prod_{j = 1}^J \frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{1}{y_j!} \cdot \frac{\Gamma(\alpha + y_j)}{(\beta + 1)^{\alpha + y_j - 1}}\] ### (b)
Compute the marginal posterior density of the hyperparameters and plots its contours. Simulate random draws from the posterior distribution of the hyperparameters and make a scatterplot of the simulation draws.
Here, I can just recycle the code from the previous question:
s <- seq(3.5, 6, length = 100)
t <- seq(0, 3.5, length = 100)
a <- outer(s, t, to_alpha)
b <- outer(s, t, to_beta)
log_lik <- matrix(nrow = length(s), ncol = length(t))
for (i in seq_along(s)) {
for (j in seq_along(t)) {
aa <- a[i, j]; bb <- b[i, j]
log_lik[i, j] <- sum(aa * log(bb) - lgamma(aa) - lfactorial(n) +
lgamma(aa + n) - (aa + n) * log(bb + 1))
}
}
log_post <- log_prior + log_jacob + log_lik
log_post <- log_post - max(log_post)
post <- exp(log_post)
post <- post / sum(post)
contour(s, t, post, col = 'steelblue',
drawlabels = FALSE, xlab = expression(log(alpha / beta)),
ylab = expression(log(alpha + beta)), axes = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
post_s <- rowSums(post)
post_s <- post_s / sum(post_s)
s_inds <- sample(seq_along(s), 1e3, prob = post_s, replace = TRUE)
s_samples <- s[s_inds]
t_samples <- numeric(1e3)
for (i in seq_along(s_samples)) {
cond <- post[s_inds[i], ]
cond <- cond / sum(cond)
t_samples[i] <- sample(t, 1, prob = cond)
}
plot(s_samples, t_samples, pch = 19, col = colt('seagreen', 0.1),
axes = FALSE, xlab = expression(log(alpha / beta)[samples]),
ylab = expression(log(alpha + beta)[samples]),
xlim = c(3.5, 6), ylim = c(0, 3.5))
contour(s, t, post, col = 'steelblue',
drawlabels = FALSE, add = TRUE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Is the posterior density integrable? Answer analytically by examining the joint posterior density at the limits or empirically by examining the plots of the marginal posterior density above.
Looking at the plots, the posterior mode sits concentrated around the points (4.75, 1.5), and does not seem to stretch towards either zero or infinity, so it should be integrable.
Draw samples from the joint posterior distribution of the parameters and hyperparameters, by analogy to the method used in the hierarchical binomial model.
Again, I can recycle previous code:
a_samples <- to_alpha(s_samples, t_samples)
b_samples <- to_beta(s_samples, t_samples)
theta_samples <- matrix(0, ncol = length(y), nrow = 1e3)
par(mfrow = c(5, 2), mar = c(2.5, 2.5, 1.5, 1.5))
for (j in seq_along(y)) {
theta_samples[, j] <- rgamma(1e3, a_samples + n[j], b_samples + 1)
hist(theta_samples[, j], axes = FALSE, main = bquote(theta[.(j)]),
col = 'steelblue', border = 'white',
xlab = expression(tilde(theta)), ylab = 'Count')
points(n[j], 0, col = 'firebrick', pch = 19)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
}
theta_quants <- apply(theta_samples, 2, quantile, c(0.025, 0.975))
nhat <- n
par(mar = c(5, 5, 2, 2), mfrow = c(1, 1))
plot(seq(50, 320, 0.1), seq(50, 320, 0.1), type = 'l', col = 'grey60', axes = FALSE,
xlab = expression(hat(theta) == n), ylab = expression(hat(theta)[hierarchical]))
segments(nhat, theta_quants[1, ], nhat, theta_quants[2, ], col = 'steelblue')
points(nhat, colMeans(theta_samples), pch = 19, col = 'steelblue')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Perform the computations for hte meta-analysis of data in Table 5.4
Plot the posterior density of \(\tau\) over an appropriate range that includes essentially all of the posterior density
y_control <- c(3, 14, 11, 127, 27, 6, 152, 48, 37, 188,
52, 47, 16, 45, 31, 38, 12, 6, 3, 40, 43, 39)
n_control <- c(39, 116, 93, 1520, 365, 52, 939, 471, 282,
1921, 583, 266, 293, 883, 147, 213, 122,
154, 134, 218, 364,674)
y_treat <- c(3, 7, 5, 102, 28, 4, 98, 60, 25, 138, 64,
45, 9, 57, 25, 33, 28, 8, 6, 32, 27, 22)
n_treat <- c(38, 114, 69, 1533, 355, 59, 945, 632, 278,
1916, 873, 263, 291, 858, 154, 207, 251,
151, 174, 209, 391, 680)
y_logit <- log(y_treat / (n_treat - y_treat)) -
log(y_control / (n_control - y_control))
sig2_logit <- 1 / y_treat + 1 / (n_treat - y_treat) +
1 / y_control + 1 / (n_control - y_control)
muhat <- function(tau) sum(y_logit / (sig2_logit + tau^2)) / sum(1 / (sig2_logit + tau^2))
vmu <- function(tau) 1 / sum(1 / (sig2_logit + tau^2))
tau <- seq(0.001, 0.6, length = 200)
lp_tau <- function(tau) {
1/2 * log(vmu(tau)) + sum(-1/2 * log(sig2_logit + tau^2) -
(y_logit - muhat(tau))^2 / (2 * (sig2_logit + tau^2)))
}
log_post_tau <- sapply(tau, lp_tau)
log_post_tau <- log_post_tau - max(log_post_tau)
post_tau <- exp(log_post_tau)
post_tau <- post_tau / sum(post_tau)
plot(tau, exp(log_post_tau), type = 'l', col = 'steelblue', lwd = 2,
xlab = expression(tau), ylab = 'Density', axes = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Produce graphs analogous to figures 5.6 and 5.7 to display how the posterior means and standard deviations of the \(\theta_j\)’s depend on \(\tau\)
theta_hat <- function(mu, tau, j) (y_logit[j] / sig2_logit[j] + mu / tau^2) /
(1 / sig2_logit[j] + 1 / tau^2)
theta_var <- function(tau, j) 1 / (1 / sig2_logit[j] + 1 / tau^2)
theta_mean <- matrix(0, ncol = length(y_logit), nrow = length(tau))
theta_sd <- theta_mean
mu <- sapply(tau, muhat)
for (j in seq_along(y_logit)) {
theta_mean[, j] <- theta_hat(mu, tau, j)
theta_sd[, j] <- sqrt(theta_var(tau, j))
}
plot(seq(0, 0.6, length = 100), seq(-0.6, 0.25, length = 100), type = 'n',
xlab = expression(tau), ylab = expression(hat(theta[j])), axes = FALSE)
for (j in seq_along(y_logit)) lines(tau, theta_mean[, j], col = 'steelblue')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
plot(seq(0, 0.6, length = 100), seq(0, 0.6, length = 100), type = 'n',
xlab = expression(tau), ylab = expression(sd(theta[j])), axes = FALSE)
for (j in seq_along(y_logit)) lines(tau, theta_sd[, j], col = 'steelblue')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Produce a scatterplot of the crude effect estimates vs. the posterior median effect estimates of the 22 studies. Verify that the studies with the smallest sample sizes are partial pooled most towards the mean.
set.seed(123456)
tau_samples <- sample(tau, 1e3, prob = post_tau, replace = TRUE)
mu_samples <- rnorm(1e3, sapply(tau_samples, muhat), sqrt(sapply(tau_samples, vmu)))
theta_samples <- matrix(0, ncol = length(y_logit), nrow = 1e3)
for (j in seq_along(y_logit)) {
theta_samples[, j] <- rnorm(1e3, theta_hat(mu_samples, tau_samples, j),
sqrt(theta_var(tau_samples, j)))
}
theta_quants <- apply(theta_samples, 2, quantile, c(0.025, 0.975))
plot(seq(-0.75, 0.25, 0.001), seq(-0.75, 0.25, 0.001), type = 'l',
col = 'grey60', axes = FALSE,
xlab = expression(logit(y)), ylab = expression(hat(theta)[hierarchical]))
abline(h = median(mu_samples), lty = 'dashed', col = 'grey80')
segments(y_logit, theta_quants[1, ], y_logit, theta_quants[2, ], col = 'steelblue')
points(y_logit, apply(theta_samples, 2, median), pch = 19,
cex = sqrt(n_treat + n_control)/50, col = 'firebrick')
text(y_logit, theta_quants[2, ] + 0.05, labels = n_treat + n_control)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
The smaller studies (small points) are typically closer ot the dashed grey line, which is the population median.
Draw simulations from the posterior distribution of a new treatment effect \(\tilde \theta_j\). Plot a histogram of the simulation.
theta_tilde_samples <- rnorm(1e4, mu_samples, tau_samples)
hist(theta_tilde_samples, axes = FALSE, main = NULL, breaks = 20,
col = 'steelblue', border = 'white',
xlab = expression(tilde(theta)), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
mean(theta_tilde_samples > 0)
## [1] 0.0617
There is about 6.5% probability that the treatment is the same as or worse than the placebo.
Suppose we wish to apply the inferences from the meta-analysis example in Section 5.6 to data on a new study with equal numbers of people in the control and treatment groups. How large would the study have to be so that the prior and data were weighted equally in the posterior inference for that study?
Using the posterior expectation formula:
\[\theta_j = \frac{\frac{\bar y_j}{\sigma^2_j} + \frac{\mu}{\tau^2}}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}}\] Clearly, we want \(\frac{1}{\sigma_j^2} = \frac{1}{\tau^2}\)
Our posterior estimate for \(\frac{1}{\tau^2}\) is:
(tau_inv <- mean(1 / tau_samples))
## [1] 24.77084
Therefore, we want \(\frac{1}{\sigma_j^2} =\) 24.771. Hence, using the information that the treatment and the control group should be the same size, we have:
\[\frac{1}{y_1} + \frac{1}{n_1 - y_1} + \frac{1}{y_0} + \frac{1}{n_0 - y_0} = \frac{1}{24.77}\] \[\implies \frac{1}{y_1} + \frac{1}{n - y_1} + \frac{1}{y_0} + \frac{1}{n - y_0} = \frac{1}{24.77}\] I couldn’t figure out how to convert the posterior for log-odds ratio to something which would give me \(y_1\) and \(y_0\) (or \(p_1\) and \(p_0\)), so I instead decided to do a slightly hacky solution by first estimating \(p_1\) and \(p_0\) using two independent hierarchical models like in exercise 5.13 and then calculating the variance for simulated posterior predictive draws of \(y_1\) and \(y_0\) given \(n\):
a <- seq(0.01, 15, length = 100)
b <- seq(10, 150, length = 100)
y1 <- y_treat
y0 <- y_control
n1 <- n_treat
n0 <- n_control
log_post1 <- matrix(0, nrow = length(a), ncol = length(b))
log_post0 <- log_post1
for (i in seq_along(a)) {
for (j in seq_along(b)) {
aa <- a[i]; bb <- b[j]
log_post1[i, j] <- sum(lgamma(aa + bb) - lgamma(aa) - lgamma(bb) +
lgamma(aa + y1) + lgamma(bb + n1 - y1) -
lgamma(aa + bb + n1))
log_post0[i, j] <- sum(lgamma(aa + bb) - lgamma(aa) - lgamma(bb) +
lgamma(aa + y0) + lgamma(bb + n0 - y0) -
lgamma(aa + bb + n0))
}
}
post1 <- exp(log_post1 - max(log_post1))
post0 <- exp(log_post0 - max(log_post0))
post_a1 <- rowSums(post1)
post_a1 <- post_a1 / sum(post_a1)
post_a0 <- rowSums(post0)
post_a0 <- post_a0 / sum(post_a0)
a1_inds <- sample(seq_along(a), 1e3, prob = post_a1, replace = TRUE)
a0_inds <- sample(seq_along(a), 1e3, prob = post_a0, replace = TRUE)
a1_samples <- a[a1_inds]
a0_samples <- a[a0_inds]
b1_samples <- numeric(1e3)
b0_samples <- numeric(1e3)
for (i in seq_along(a1_samples)) {
cond1 <- post1[a1_inds[i], ]
cond0 <- post0[a0_inds[i], ]
cond1 <- cond1 / sum(cond1)
cond0 <- cond0 / sum(cond0)
b1_samples[i] <- sample(b, 1, prob = cond1)
b0_samples[i] <- sample(b, 1, prob = cond0)
}
p1_samples <- rbeta(1e3, a1_samples, b1_samples)
p0_samples <- rbeta(1e3, a0_samples, b0_samples)
n <- seq(400, 800, 10)
sig2_samples <- matrix(0, ncol = length(n), nrow = 1e3)
for (j in seq_along(n)) {
y1 <- rbinom(1e3, n[j], p1_samples)
y0 <- rbinom(1e3, n[j], p0_samples)
sig2_samples[, j] <- pmin(1 / y1, 1) + 1 / (n[j] - y1) + pmin(1 / y0, 1) + 1 / (n[j] - y0)
}
plot(n, colMeans(sig2_samples), type = 'l', col = 'steelblue', ylab = expression(hat(sigma^2)),
axes = FALSE)
abline(h = 1 / 24.77, col = 'firebrick', lty = 'dashed')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
(n_final <- n[min(which(colMeans(sig2_samples - 1/24.77) < 0))])
## [1] 690
Thus, we should recruit about 690` participants.
Consider a hypothetical sutdy of a simple training program for basketball free-throw shooting. A random sample of 100 college students is recruited into the study. Each student first takes 100 free-throws to establish a baseline success probability. Each student then takes 50 practice shots each day for a month. At the end of that timel he or she takes 100 shots for final measurement.
Let \(\theta_i\) be the improvement in success probability for person \(i\). For simplicity, assume the \(\theta_i\)’s are normally distributed with mean \(\mu\) and standard deviation \(\sigma\). Give three joint priors for \(\mu, \sigma\)
A non-informative prior distribution.
For this, we could use a distribution that is uniform on \(\log(\sigma)\), i.e. \(p(\mu, \sigma) \propto \frac{1}{\sigma}\).
A subjective prior distribution based on your best knowledge.
I don’t know much about basketball, but let’s say that a simple realistic-ish hyperprior distribution could be bivariate normal centered on \(\begin{bmatrix} \mu \\ \sigma \end{bmatrix} = \begin{bmatrix} 20 \\ 20 \end{bmatrix}\). Additionally, I would imagine that the improvements of people who improve a lot would be more similar to each compared compared to the improvements of people who improve a little or even get worse, and so the covariance matrix should have negative off-diagonals.
I experimented with a few different versions of the covariance matrix and came up with one I like: \(\Sigma = \begin{bmatrix} 200 & -25 \\ -25 & 25 \end{bmatrix}\)
set.seed(123456)
X <- MASS::mvrnorm(1e3, mu = c(20, 20), Sigma = matrix(c(200, -25, -25, 25), ncol = 2))
plot(X[, 1], X[, 2], col = 'steelblue', pch = 19,
xlab = expression(mu), ylab = expression(sigma), axes = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
A weakly informative prior.
A weakly informative prior might be two independent priors on \(\mu\) and \(\sigma\), with both means 20 and standard deviations 100 and 10.
Using the data from the SAT coaching experiments, conduct a posterior predictive check for the model which assumed identical effects in all eight schools. Specifically, do a comparison to expected order statistics \((26, 19, 14, 10, 6, 2, -3, -9)\). Does the model fit these order statstics?
We have the following posterior for \(\theta\) and likelihood:
\[\theta \lvert y \sim \text{Normal}(7.7, 4.1^2)\] \[y \lvert \theta \sim \text{Normal}(\theta, \sigma_j^2)\]
y <- c(28, 8, -3, 7, -1, 1, 18, 12)
y <- sort(y)
sigma <- c(15, 10, 16, 11, 9, 11, 10, 18)
theta <- rnorm(1000, 7.7, 4.1)
yrep <- sapply(theta, function(x) rnorm(8, x, sd = sigma))
yrep <- apply(yrep, 2, sort)
rowMeans(yrep)
## [1] -10.292253 -2.380525 2.366894 6.247256 9.802652 13.623297 18.412058
## [8] 27.322192
The order statistic means from the posterior predictive check are very similar to the expected order statistics. Thus, the model seems to fit this aspect of the data.
Based on theory, it seems unlikely that the intervention would be implemented equally well across all 8 schools & that the effect would be the same.
In exercise 2.13, the counts of airline fatalities in 1976-1985 were fitted to four different Poisson models
(I actually only fitted the first two models because the other two were the same but with different data, I will do the same here)
For each of the models, set up a posterior predictive test quantities to check the following assumptions: (1) independent Poisson distributions, (2) no trend over time. Display the discrepancies graphically and give p-values.
To test (1), we can compute autocorrelations in the sample and posterior predictive draws. To test (2), we can compute the linear trend with year in posterior predictive draws & compare to the sample.
year <- 1976:1985
accidents <- c(24, 25, 31, 31, 22, 21, 26, 20, 16, 22)
deaths <- c(734, 516, 754, 877, 814, 362, 764, 809, 223, 1066)
death_rate <- c(0.19, 0.12, 0.15, 0.16, 0.14, 0.06, 0.13, 0.13, 0.03, 0.15)
miles <- deaths / death_rate
theta_s1 <- rgamma(1e3, sum(accidents), length(accidents))
theta_s2 <- rgamma(1e3, sum(accidents),
length(accidents) * mean(miles) / 1e3)
postpred_draws1 <- matrix(rpois(10 * 1000, theta_s1), ncol = 1000)
postpred_draws2 <- matrix(rpois(10 * 1000, theta_s2 * mean(miles) / 1e3), ncol = 1000)
# Autocorrelation
ac_sample <- stats::acf(accidents, plot = FALSE)$acf
ac_draws1 <- apply(postpred_draws1, 2,
function(x) stats::acf(x, plot = FALSE)$acf)
ac_draws2 <- apply(postpred_draws2, 2,
function(x) stats::acf(x, plot = FALSE)$acf)
par(mfrow = c(1, 2))
plot(1:10, seq(-1, 1, length = 10), type = 'n', axes = FALSE,
xlab = 'Lag', ylab = 'Autocorrelation', main = 'Model 1')
for (i in 1:10) points(rep(i, 1000), ac_draws1[i, ], pch = 19,
col = colt('steelblue', 0.1))
points(1:10, ac_sample, pch = 19, col = 'firebrick')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
plot(1:10, seq(-1, 1, length = 10), type = 'n', axes = FALSE,
xlab = 'Lag', ylab = 'Autocorrelation', main = 'Model 2')
for (i in 1:10) points(rep(i, 1000), ac_draws2[i, ], pch = 19,
col = colt('steelblue', 0.1))
points(1:10, ac_sample, pch = 19, col = 'firebrick')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
ac_pvals1 <- numeric(10)
ac_pvals2 <- numeric(10)
for (i in 1:10) {
ac_pvals1[i] <- mean(ac_sample[i] > ac_draws1[i, ])
ac_pvals2[i] <- mean(ac_sample[i] > ac_draws2[i, ])
}
# Drop lag 0 autocorrelation
1 - ac_pvals1[-1]
## [1] 0.026 0.565 0.248 0.316 0.962 0.963 0.698 0.511 0.498
1 - ac_pvals2[-1]
## [1] 0.044 0.581 0.241 0.344 0.964 0.970 0.668 0.508 0.493
The lag 1 autocorrelations seemed to be a bit unusually high in our sample compared with the posterior predictive draws, however, the other lags seemed completely reasonable within the posterior predictive draws.
slope_sample <- coef(lm(accidents ~ year))
slope_draws1 <- apply(postpred_draws1, 2, function(x) coef(lm(x ~ year)))
slope_draws2 <- apply(postpred_draws2, 2, function(x) coef(lm(x ~ year)))
par(mfrow = c(1, 2))
plot(year, accidents, type = 'n', axes = FALSE, xlab = 'Year',
ylab = '# accidents', main = 'Model 1')
for (i in 1:100) abline(a = slope_draws1[1, i], b = slope_draws1[2, i],
col = colt('steelblue', 0.2))
abline(slope_sample[1], slope_sample[2], col = 'firebrick')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
plot(year, accidents, type = 'n', axes = FALSE, xlab = 'Year',
ylab = '# accidents', main = 'Model 2')
for (i in 1:100) abline(a = slope_draws2[1, i], b = slope_draws2[2, i],
col = colt('steelblue', 0.2))
abline(slope_sample[1], slope_sample[2], col = 'firebrick')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
mean(slope_sample[2] > slope_draws1[2, ])
## [1] 0.051
mean(slope_sample[2] > slope_draws2[2, ])
## [1] 0.059
The slope observed in the data did not seem to be too suspiciously unusual compared with those obtained from posterior predictive draws.
Suppose that the scalar varaible \(\theta\) is approximately normally distributed in a posterior distribution that is summarized by \(n\) independent simultion draws. How large does \(n\) have to be so that the 2.5% and 97.5% quantiles of \(\theta\) are specified to an accuracy of \(0.1 \; \text{sd}(\theta \lvert y)\)?
Figure this out mathematically, without using simulation.
According to Wikipedia, the sampling variance of a quantile \(p\) is equal to \(\frac{p(1 - p)}{N f_X(x_{p})^2}\). Thus we have:
\[\sqrt{\frac{p(1 - p)}{S \cdot p(\theta_{p} \lvert y)^2}} = 0.1 \cdot \text{sd}(\theta \lvert y)\] \[\implies \frac{p(1-p)}{S \cdot p(\theta_p \lvert y)^2} = 0.01 \cdot Var(\theta \lvert y)\] \[\implies S = \frac{p(1 - p)}{0.01 \cdot Var(\theta \lvert y) \cdot p(\theta_p \lvert y)^2}\]
We can now plug in the specific values of \(p = (0.025, 0.975)\) (both will actually yield the same numerator), however we also need to know the posterior variance and the posterior density at the (true) quantile. For example, if the posterior mean is \(E(\theta \lvert y) = 0\), the posterior variance is \(\text{Var}(\theta \lvert y) = 1\), then \(\theta_{0.025} = -1.96\) and \(p(\theta_p \lvert y)^2 = 0.0584^2 = 0.00341\). Then:
\[S = \frac{0.025 \cdot(1 - 0.025)}{0.01 \cdot 1 \cdot 0.00342} \approx 714\]
(0.025 * (1 - 0.025)) / (0.01 * dnorm(qnorm(0.025))^2)
## [1] 713.5902
(the probability should be the same for the 97.5% quantile because of the symmetry of the normal distribution)
Check you answer using simulation and show your results.
theta_draws <- matrix(rnorm(714 * 1000, 0, 1), ncol = 1000)
theta_quants <- apply(theta_draws, 2, quantile, c(0.025, 0.975))
apply(theta_quants, 1, sd)
## 2.5% 97.5%
## 0.10002788 0.09522285
The variance of the draws was close to 0.1, which is \(0.1 \cdot 1 = 0.1 \cdot \text{sd}(\theta \lvert y)\), so the method seems to work. The only issue is that often we might not know the posterior mean and variance (estimating them is the reason why we’re doing Bayesian statistics in the first place), and the posterior density at the quantile, so we might have to rely approximations (e.g. plugging-in our estimates for those quantities).
Suppose you are intested in inference for the parameters in a multivariate posterior distribution \(p(\theta \lvert y)\). You draw 100 independent values \(\theta\) from the posterior distribution of \(\theta\) and find that the posterior density for \(\theta_1\) is approximately normal with mean of about 8 and standard deviation of about 4.
Using the average of the 100 draws of \(\theta_1\) to estimate the posterior mean \(E(\theta \lvert y)\), what is the approximate standard deviation due to simulation variability?
The classical result for the standard error of the sample mean is: \(\sigma_{\bar x} = \sqrt{\frac{\sigma^2}{N}}\)
Therefore, the naive approximation of the Monte Carlo error (the standard deviation due to simulation variability) will be:
\[\sqrt{\frac{4^2}{100}} = \frac{4}{10}\] ### (b)
About how many simulation draws would you need to reduce the standard deviation of the posterior mean to 0.1 (thus justifying the presentation of results to one decimal place)?
\[\sqrt{\frac{4^2}{S}} = 0.1\] \[\implies \frac{4^2}{S} = 0.01\] \[\implies \frac{1}{S} = \frac{0.01}{4^2}\] \[\implies S = \frac{4^2}{0.01} = 1600\]
theta_draws <- matrix(rnorm(1600 * 1000, 8, 4), ncol = 1000)
sd(colMeans(theta_draws))
## [1] 0.1015697
A more usual summary of the posterior distribution of \(\theta_1\) is a 95% central posterior interval. Based on the data from 100 draws, what are the approximate standard deviations of the estimated 2.5% and 97.5% quantiles of the posterior distribution?
Using the results we got in the previous question, we found the expression of the standard deviation of the quantiles to be: \[\sqrt{\frac{p(1 - p)}{S \cdot p(\theta_{p} \lvert y)^2}}\]
Now, pluggin in the appropriate values:
\[\sqrt{\frac{0.025 \cdot (1 - 0.025)}{100 \cdot 0.0146^2}} = 1.0685\]
sqrt((0.025 * (1 - 0.025)) / (100 * dnorm(qnorm(0.025, 8, 4), 8, 4)^2))
## [1] 1.068524
I.e. the Monte Carlo standard error is close to 1 standard deviation.
About how many simulation draws would you need to reduce the simulation standard deviations of the 2.5% and 97.5% quantiles to 0.1?
Again, using the same technique as in the previous exercise, we have: \(S = \frac{p(1 - p)}{0.1 \cdot p(\theta_p \lvert y)^2}\)
(0.025 * (1 - 0.025)) / (0.01 * dnorm(qnorm(0.025, 8, 4), 8, 4)^2)
## [1] 11417.44
I.e. we’d need around 11,417 draws.
In the eight-schools example in Section 5.5, we simulated 200 posterior draws. What are the approximate simulation standard deviations of the 2.5% and 97.5% quantiles for school A in table 5.3?
From exercise 5.3, I got the marginal posterior mean for the effect in school A to be 11.7 with a standard deviation of 10.24. Using 200 samples, that would make the Monte Carlo error for the quantiles:
sqrt((0.025 * (1 - 0.025)) / (200 * dnorm(qnorm(0.025, 11.7, 10.24), 11.7, 10.24)^2))
## [1] 1.934236
I.e. almost 2 points.
Why was it not necessary, in practice, to simulate more than 200 draws for the SAT coaching example?
I suppose that in the example, we weren’t really interested in estimating the extreme quantiles of the effects in individual schools, but rather, the posterior means and probabilities that one school was better than the others, which should be more stable.
(0.025 * (1 - 0.025)) / (0.01 * dnorm(qnorm(0.025, 8, 4), 8, 4)^2)
## [1] 11417.44
Suppose \(y_1 \sim \text{Binomial}(n_1, p_1)\) is the number of patients succesfully treated under an experimental new drug, and \(y_2 \sim \text{Binomial}(n_2, p_2)\) is the number of patients treated under the standard treatment. Assume that \(y_1\) and \(y_2\) are independent and assume independent Beta prior densities for the two probabilities of success. Let \(n_1 = 10, y_1 = 6\), and \(n_20 = 20, y_2 = 10\). Repeat the following for several different beta prior specifications:
I will repeat the computations for the following choice of priors: \(\text{Beta}(1, 1)\) (uniform), \(\text{Beta}(1/2, 1/2)\) (Jeffrey’s prior for Binomial), and \(\text{Beta}(10, 10)\) (somewhat informative prior)
Use simulation to find a 95% posterior interval for \(p_1 - p_2\) and the posterior probability that \(p_1 > p_2\).
priors <- matrix(c(1, 1, 1/2, 1/2, 10, 10), nrow = 2)
n1 <- 10; y1 <- 6
n2 <- 20; y2 <- 10
post_pars1 <- priors + c(y1, n1 - y1)
post_pars2 <- priors + c(y2, n2 - y2)
post1 <- apply(post_pars1, 2, function(x) rbeta(1000, x[1], x[2]))
post2 <- apply(post_pars2, 2, function(x) rbeta(1000, x[1], x[2]))
post_diff <- post1 - post2
post_gt <- post1 > post2
apply(post_diff, 2, quantile, c(0.025, 0.975))
## [,1] [,2] [,3]
## 2.5% -0.2333925 -0.2823738 -0.21386
## 97.5% 0.4190517 0.4347905 0.25919
colMeans(post_gt)
## [1] 0.702 0.662 0.598
Numerically intergrate to estimate the posterior probability that \(p_1 > p_2\).
I first compute the joint density of \(p_1\) and \(p_2\) across a finite grid of values, normalize so that the posterior sums to one, throw out all values where \(p_1 < p_2\), and finally sum up the probability in this region:
p <- seq(0, 1, length = 100)
post_joint <- outer(p, p, function(x, y) dbeta(x, 7, 5) * dbeta(y, 11, 11))
post_joint <- post_joint / sum(post_joint)
post_diff2 <- post_joint[row(post_joint) > col(post_joint)]
sum(post_diff2)
## [1] 0.6762958
Prove that rejection sampling gives draws from \(p(\theta \lvert y)\)
Let \(\theta^*\) be a proposed draw from \(g(\theta)\) and \(U\) a draw from a \(\text{Uniform}(0, 1)\) distribution. We accept \(\theta\) if \(u \leq \frac{p(\theta \lvert y)}{M \cdot g(\theta)}\), where \(M\) is the constant that assures that \(g(\theta)\) is the blanketing density of \(p(\theta \lvert y)\).
Now:
\[P(\theta^* \leq \theta \lvert \theta^* \text{ is accepted}) = \frac{P(\theta^* \leq \theta, \theta^* \text{ is accepted})}{P(\theta^* \text{ is accepted})}\] \[= \frac{P(\theta^* \leq \theta, u \leq \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)})}{P(u \leq \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)})}\]
Now the numerator is:
\[P \Bigg( \theta^* \leq \theta, u \leq \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)} \Bigg) = \int_{-\infty}^\theta P \bigg(u \leq \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)} \bigg) g(\theta^*) d\theta^*\] \[= \int_{-\infty}^\theta \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)} g(\theta^*) d\theta^*\] \[= \frac{1}{M} \int_{-\infty}^\theta p(\theta^* \lvert y) g(\theta^*) d\theta^* = \frac{1}{M} \cdot F(\theta)\]
Likewise, for the denominator:
\[P \Bigg(u \leq \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)} \Bigg) = \int_{-\infty}^\infty P \Bigg(u \leq \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)} \Bigg) g(\theta^*) d \theta^*\] \[= \int_{-\infty}^\infty \frac{p(\theta^* \lvert y)}{M \cdot g(\theta^*)} g(\theta^*) d \theta^*\] \[= \frac{1}{M}\]
Therefore:
\[P(\theta^* \leq \theta \lvert \theta^* \text{ is accepted}) = \frac{\frac{1}{M} \cdot F(\theta)}{\frac{1}{M}} = F(\theta)\]
Why is the boundedness condition necessary for rejection sampling?
We need \(g(\theta) > 0\) wherever \(p(\theta \lvert y) > 0\) (i.e. we need some non-zero sampling probability for sampling from all of the regions of \(p(\theta \lvert y)\)) and we also need \(\frac{p(\theta \lvert y)}{M \cdot g(\theta)} \leq 1\), because otherwise we’re not guaranteed to sample enough from regions of high posterior density, in proportion to other regions.
Consider the model \(y_j \sim \text{Binomial}(n_j, \theta_j)\) where \(\theta_j = \text{logit}^{-1}(\alpha + \beta x_j)\) for \(j = 1, 2, \ldots J\), with independent prior distributions \(\alpha \sim t_4 (0, 2^2)\) and \(\beta \sim t_4(0, 1)\). Suppose \(J = 10\), where \(x_j\) values are randomly drawn from \(\text{Uniform}(0, 1)\) distribution and \(n_j \sim \text{Poisson}^+(5)\) (Poisson distribution restricted to positive values).
Sample a dataset at random from the model
set.seed(12345)
J <- 10
n <- rpois(J, 5)
x <- runif(J)
alpha <- 2 * rt(1, 4)
beta <- rt(1, 4)
eta <- alpha + beta * x
theta <- 1 / (1 + exp(-eta))
y <- rbinom(J, n, theta)
Use rejection sampling to get 1,000 independent posterior draws from \((\alpha, \beta)\).
The student-t distribution has the following pdf:
\[f_X(x) = \frac{\Gamma(\frac{\nu + 1}{2})}{\sqrt{\nu \pi} \Gamma (\frac{\nu}{2})} (1 + \frac{x^2}{\nu})^{-\frac{\nu + 1}{2}}\]
Thus:
\[p(\beta) = \frac{\Gamma(\frac{4 + 1}{2})}{\sqrt{4 \pi} \cdot \Gamma (\frac{4}{2})} (1 + \frac{\beta^2}{4})^{-\frac{4 + 1}{2}}\] \[= \frac{\frac{3}{4} \sqrt{\pi}}{2 \sqrt{\pi} \cdot 1} (1 + \frac{\beta^2}{4})^{-\frac{5}{2}} = \frac{3}{8}(1 + \frac{\beta^2}{4})^{-\frac{5}{2}}\] Now to find \(p(\alpha)\), we can just apply change of variable method:
\[\alpha = \alpha(\beta) = 2 \beta\] \[\beta = \beta(\alpha) = \frac{1}{2} \alpha\] \[ \Bigg\lvert \frac{\partial \beta}{\partial \alpha} \Bigg\lvert = \Bigg\lvert \frac{1}{2} \Bigg\lvert\] \[p(\alpha) = \frac{3}{8}(1 + \frac{(\frac{\alpha}{2})^2}{4})^{-\frac{5}{2}} \cdot \frac{1}{2} = \frac{3}{16}(1 + \frac{\alpha^2}{16})^{-\frac{5}{2}}\]
Now, the likelihood \(p(y_j \lvert n_j, \theta_j)\) is just a simple binomial likelihood:
\[p(y_j \lvert n_j, \theta_j) \propto \theta_j^{y_j} (1 - \theta_j)^{n_j - y_j}\]
Expressed in terms of \(\alpha\) and \(\beta\):
\[p(y_j \lvert \alpha, \beta, x_j) \propto \bigg( \frac{1}{1+ e^{-(\alpha + \beta x_j)}} \bigg)^{y_j} \bigg( \frac{e^{-(\alpha + \beta x_j)}}{1+ e^{-(\alpha + \beta x_j)}} \bigg)^{n_j - y_j}\]
So then the posterior is:
\[p(\alpha, \beta \lvert y, x) \propto (1 + \frac{\beta^2}{4})^{-\frac{5}{2}}(1 + \frac{\alpha^2}{16})^{-\frac{5}{2}} \prod_{j=1}^J \bigg( \frac{1}{1+ e^{-(\alpha + \beta x_j)}} \bigg)^{y_j} \bigg( \frac{e^{-(\alpha + \beta x_j)}}{1+ e^{-(\alpha + \beta x_j)}} \bigg)^{n_j - y_j}\] As the envelope distribution, I used a bivariate normal distribution with independent components. Since the parameters in the posterior were correlated (i.e. the posterior had a diagonal shape) I had to play around with different values of \(\mu\)’s and \(\sigma\)’s to make sure that the posterior wouldn’t have much greater relative density to that of the blanketing density.
d <- n - y
lpost <- function(theta) {
a <- theta[1]; b <- theta[2]
-5/2 * log(1 + b^2/ 4) - 5/2 * log(1 + a^2/16) +
sum(y * (a + b * x) - n * log(1 + exp(a + b * x)))
}
lprop <- function(theta) {
a <- theta[1]; b <- theta[2]
dnorm(a, mus[1], sds[1], log = TRUE) +
dnorm(b, mus[2], sds[2], log = TRUE)
}
# Find maximum of posterior and proposal density (used to normalize later)
post_mode <- optim(c(-1, 0), lpost, control = list(fnscale = -1))
max_post <- post_mode$value
mus <- post_mode$par
sds <- c(0.5, 1)
max_prop <- lprop(mus)
a <- seq(-2, 1, length = 200)
b <- seq(-2.5, 3, length = 200)
post <- matrix(0, ncol = length(a), nrow = length(b))
prop_dist <- post
for (i in seq_along(a)) {
for (j in seq_along(b)) {
post[i, j] <- lpost(c(a[i], b[j]))
prop_dist[i, j] <- dnorm(a[i], mus[1], sds[1], log = TRUE) +
dnorm(b[j], mus[2], sds[2], log = TRUE)
}
}
contour(a, b, exp(post), col = 'firebrick',
drawlabels = FALSE, axes = FALSE,
xlab = expression(alpha), ylab = expression(beta))
contour(a, b, exp(prop_dist), add = TRUE, col = 'seagreen',
drawlabels = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE)
box(bty = 'L', col = 'grey60')
post_ratio <- function(theta) {
lpost(theta) - lprop(theta) - max_post - max_prop
}
M <- optim(c(-1, 0), post_ratio,
lower = c(-2, -2.5), upper = c(1, 3),
method = 'L-BFGS-B',
control = list(fnscale = -1))$value
n_samples <- 1000
n_sampled <- 0
ab_samples <- matrix(0, ncol = 2, nrow = n_samples)
while (n_sampled < n_samples) {
prop <- c(rnorm(1, mus[1], sds[1]), rnorm(1, mus[2], sds[2]))
prop_dens <- lprop(prop) - max_prop
post_dens <- lpost(prop) - max_post
if (log(runif(1)) < post_dens - prop_dens - M) {
ab_samples[n_sampled, ] <- prop
n_sampled <- n_sampled + 1
}
}
plot(ab_samples[, 1], ab_samples[, 2],
col = colt('steelblue', 0.25), pch = 19, axes = FALSE,
xlab = expression(alpha), ylab = expression(beta))
contour(a, b, exp(post), col = 'firebrick',
add = TRUE, drawlabels = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE)
box(bty = 'L', col = 'grey60')
Approximate the posterior density for \((\alpha, \beta)\) by a normal centered at the posterior mode with covariance matrix fit to the curvature at the mode.
Now, to make things a bit more interesting, I will use the Newton-Raphson algorithm to get the posterior mode. First I need all of the first & second derivatives of the posterior density:
\[\log p(\alpha, \beta \lvert y_j, x_j) = -\frac{5}{2} \cdot \log(1 + \frac{\alpha^2}{16}) -\frac{5}{2} \cdot \log(1 + \frac{\beta^2}{4}) + \sum_{j=1}^J y_j (\alpha + \beta x_j) - n_j \cdot \log(1 + e^{\alpha + \beta x_j})\]
\[\frac{\partial \log p(\alpha, \beta \lvert y_j, x_j)}{\partial \alpha} = -\frac{5}{2} \cdot \frac{1}{1 + \frac{\alpha^2}{16}} \cdot \frac{\alpha}{8} + \sum_{j=1}^J y_j - n_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} = -\frac{5\alpha}{16 + \alpha^2} + \sum_{j=1}^J y_j - n_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}}\] \[\frac{\partial \log p(\alpha, \beta \lvert y_j, x_j)}{\partial \beta} = -\frac{5}{2} \cdot \frac{1}{1 + \frac{\beta^2}{4}} \cdot \frac{\beta}{2} + \sum_{j=1}^J y_jx_j - n_jx_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} = -\frac{5 \beta}{4 + \beta^2} + \sum_{j=1}^J y_jx_j - n_jx_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}}\] \[\frac{\partial^2 \log p(\alpha, \beta \lvert y_j, x_j)}{\partial \alpha^2} = -\frac{5(16 + \alpha^2) - 2\alpha \cdot 5 \alpha}{(16 + \alpha^2)^2} - \sum_{j=1}^J n_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} \cdot \frac{1}{1 + e^{\alpha + \beta x_j}} = -\frac{80 - 5 \alpha^2}{(16 + \alpha^2)^2} - \sum_{j=1}^J n_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} \cdot \frac{1}{1 + e^{\alpha + \beta x_j}}\] \[\frac{\partial^2 \log p(\alpha, \beta \lvert y_j, x_j)}{\partial \beta^2} = -\frac{5(4 + \beta^2) - 2\beta \cdot 5 \beta}{(4 + \beta^2)^2} - \sum_{j=1}^J n_jx_j^2 \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} \cdot \frac{1}{1 + e^{\alpha + \beta x_j}} = -\frac{20 - 10 \beta^2}{(4 + \beta^2)^2} - \sum_{j=1}^J n_jx_j^2 \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} \cdot \frac{1}{1 + e^{\alpha + \beta x_j}}\] \[\frac{\partial^2 \log p(\alpha, \beta \lvert y_j, x_j)}{\partial \alpha \partial \beta} = -\sum_{j=1}^J n_j x_j \cdot \frac{e^{\alpha + \beta x_j}}{1 + e^{\alpha + \beta x_j}} \frac{1}{1 + e^{\alpha + \beta x_j}}\]
ilogit <- function(a, b, x) exp(a + b * x) / (1 + exp(a + b * x))
da <- function(a, b) - (5 * a) / (16 + a^2) + sum(y - n * ilogit(a, b, x))
db <- function(a, b) -(5 * b) / (4 + b^2) + sum(y * x - n * x * ilogit(a, b, x))
da2 <- function(a, b) -(80 - 5 * a^2) / (16 + a^2)^2 - sum(n * ilogit(a, b, x) * (1 - ilogit(a, b, x)))
db2 <- function(a, b) -(20 - 5 * b^2) / (4 + b^2)^2 - sum(n * x^2 * ilogit(a, b, x) * (1 - ilogit(a, b, x)))
dab <- function(a, b) -sum(n * x * ilogit(a, b, x) * (1 - ilogit(a, b, x)))
theta_old <- c(Inf, Inf)
theta_new <- c(0, 0)
eps <- 1e-5
n_iter <- 0
while(any(abs(theta_old - theta_new) > eps)) {
u <- c(da(theta_new[1], theta_new[2]), db(theta_new[1], theta_new[2]))
H <- matrix(c(
da2(theta_new[1], theta_new[2]),
dab(theta_new[1], theta_new[2]),
dab(theta_new[1], theta_new[2]),
db2(theta_new[1], theta_new[2])), ncol = 2)
theta_old <- theta_new
theta_new <- theta_old - solve(H) %*% u
n_iter <- n_iter + 1
}
covmat <- matrix(c(
da2(theta_new[1], theta_new[2]),
dab(theta_new[1], theta_new[2]),
dab(theta_new[1], theta_new[2]),
db2(theta_new[1], theta_new[2])),
ncol = 2)
n_iter
## [1] 4
ab_samples2 <- MASS::mvrnorm(1e3, theta_new, solve(-covmat))
plot(ab_samples2[, 1], ab_samples2[, 2], pch = 19, col = colt('steelblue', 0.25),
axes = FALSE, xlab = expression(alpha), ylab = expression(beta))
contour(a, b, exp(post), col = 'firebrick',
add = TRUE, drawlabels = FALSE)
axis(1, tick = FALSE)
axis(2, tick = FALSE)
box(bty = 'L', col = 'grey60')
Take 1,000 draws from two-dimensional \(t_4\) distribution with that center and scale matrix and use importance sampling to estimate \(E(\alpha \lvert y)\) and \(E(\beta \lvert y)\).
ab_samples3 <- t(t(mvtnorm::rmvt(1e3, solve(-covmat), 4)) +
as.numeric(theta_new))
weights <- apply(ab_samples3, 1, function(x) lpost(x)) -
apply(t(t(ab_samples3) - as.numeric(theta_new)), 1,
function(x) mvtnorm::dmvt(x, sigma = solve(-covmat)))
cbind(
colMeans(ab_samples),
colMeans(ab_samples2),
colSums(t(t(ab_samples3) * exp(weights))) / sum(exp(weights))
)
## [,1] [,2] [,3]
## [1,] -0.6749608 -0.6495221 -0.6400854
## [2,] 0.1800793 0.1432672 0.1551622
Compute an estimate of effective sample size for importance sampling.
\[S_{eff} = \frac{1}{\sum_{s=1}^S (\tilde w (\theta^s))^2}\]
weights_normalized <- weights / sum(weights)
1 / sum(weights_normalized^2)
## [1] 986.662
Consider a multivariate posterior \(p(\theta \lvert y)\) which we wish to approximate and then calculate the moments of, using importance sampling from an unnormalized density \(g(\theta)\). Suppose the posterior distribution is normal, and the approximation is \(t_3\), with mode and curvature matched to the posterior density.
Draw a sample of size \(S = 100\) from the approximate density and compute the importance ratios. Plot histogram of the log importance ratios.
mu <- 5
sigma <- 10
dens_norm <- function(x, dens_fun) {
dens <- dens_fun(x)
dens - max(dens)
}
samples <- sigma * rt(100, 3) + mu
dens1 <- dens_norm(samples, function(x) dnorm((x - mu) / sigma, 0, 1, log = TRUE))
dens2 <- dens_norm(samples, function(x) dt((x - mu) / sigma, 3, log = TRUE))
weights <- exp(dens1) / exp(dens2)
hist(log(weights), axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(log(tilde(w))), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Estimate \(E(\theta \lvert y)\) and \(Var(\theta \lvert y)\) using importance sampling. Compare to the true values.
# E(X)
(mu_hat <- sum(samples * weights) / sum(weights))
## [1] 4.041733
# Var(X)
(var_hat <- (sum((samples - mu_hat)^2 * weights) / sum(weights)))
## [1] 92.8332
(sigma_hat <- sqrt(var_hat))
## [1] 9.634999
Repeat (a) and (b) for 10,000 samples.
samples2 <- sigma * rt(1e4, 3) + mu
dens1 <- dens_norm(samples2, function(x) dnorm((x - mu) / sigma, 0, 1, log = TRUE))
dens2 <- dens_norm(samples2, function(x) dt((x - mu) / sigma, 3, log = TRUE))
weights2 <- exp(dens1) / exp(dens2)
hist(log(weights2), axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(log(tilde(w))), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
# E(X)
(mu_hat2 <- sum(samples2 * weights2) / sum(weights2))
## [1] 4.918767
# Var(X)
(var_hat2 <- (sum((samples2 - mu_hat2)^2 * weights2) / sum(weights2)))
## [1] 101.9323
(sigma_hat2 <- sqrt(var_hat2))
## [1] 10.09615
Using the sample obtained in (c), compute an estimate of effective sample size.
1 / sum((weights2 / sum(weights2))^2)
## [1] 9191.578
Repeat the previous exercise, but with a \(t_3\) posterior and a normal approximation. Explain why the estimate of \(Var(\theta \lvert y)\) are systematically too low
samples <- rnorm(100, mu, sigma)
dens1 <- dens_norm(samples, function(x) dt((x - mu) / sigma, 3, log = TRUE))
dens2 <- dens_norm(samples, function(x) dnorm((x - mu) / sigma, 0, 1, log = TRUE))
weights <- exp(dens1) / exp(dens2)
weights <- (dt((samples - mu) / sigma, 3) / sigma) /
(dnorm((samples - mu) / sigma, 0, 1) / sigma)
hist(log(weights), axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(log(tilde(w))), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
# E(X)
(mu_hat <- sum(samples * weights) / sum(weights))
## [1] 8.611275
# Var(X)
(var_hat <- (sum((samples - mu_hat)^2 * weights) / sum(weights)))
## [1] 211.6488
(sigma_hat <- sqrt(var_hat))
## [1] 14.54815
samples2 <- rnorm(1e4, mu, sigma)
dens1 <- dens_norm(samples2, function(x) dt((x - mu) / sigma, 3, log = TRUE))
dens2 <- dens_norm(samples2, function(x) dnorm((x - mu) / sigma, 0, 1, log = TRUE))
weights2 <- exp(dens1) / exp(dens2)
hist(log(weights2), axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(log(tilde(w))), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
# E(X)
(mu_hat2 <- sum(samples2 * weights2) / sum(weights2))
## [1] 4.663964
# Var(X)
(var_hat2 <- (sum((samples2 - mu_hat2)^2 * weights2) / sum(weights2)))
## [1] 159.2284
(sigma_hat2 <- sqrt(var_hat2))
## [1] 12.61858
Consider the bioassay example introduced in Section 3.7. Use importance resampling to approximate draws from the posterior distribution of the parameter \((\alpha, \beta)\) using the normal approximation of Section 4.1 as the starting distribution. Sample \(S = 10,000\) samples from the approximate distribution, and resample without replacement \(k = 1,000\) samples. Compare simulations of \((\alpha, \beta)\) to Figure 3.3b and discuss any discrepancies.
# Code copied from Chapter 4
y <- c(0, 1, 3, 5)
x <- c(-0.86, -0.3, -0.05, 0.73)
llfun <- function(theta) {
a <- theta[1]; b <- theta[2]
sum(y * (a + b * x) - 5 * log(1 + exp(a + b * x)))
}
mode <- optim(c(1, 1), llfun, control = list(fnscale = -1))$par
a <- mode[1]
b <- mode[2]
# Second derivaties evaluated at mode
plpa2 <- -sum( (5 * exp(a + b * x)) / (1 + exp(a + b * x))^2)
plpab <- -sum( (5 * x * exp(a + b * x)) / (1 + exp(a + b * x))^2)
plpb2 <- -sum( (5 * x^2 * exp(a + b * x)) / (1 + exp(a + b * x))^2)
I <- matrix(c(plpa2, plpab, plpab, plpb2), ncol = 2)
var_theta <- -solve(I)
draws_approx <- mvtnorm::rmvnorm(1e4, mode, var_theta)
post_dens <- apply(draws_approx, 1, llfun)
post_dens <- post_dens - max(post_dens)
weights <- exp(post_dens) / mvtnorm::dmvnorm(draws_approx, mode, var_theta)
prob <- weights / sum(weights)
samples <- draws_approx[sample(1:1e4, 1e3, replace = FALSE, prob = prob), ]
plot(samples[, 1], samples[, 2], pch = 19, col = colt('steelblue', 0.1),
axes = FALSE, xlab = expression(alpha), ylab = expression(beta),
xlim = c(-4, 10), ylim = c(-10, 40))
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
The posterior looks very similar to that in Figure 3.3.
Comment on the distribution of simulated importance ratios
hist(log(weights), axes = FALSE, main = NULL,
col = 'steelblue', border = 'white',
xlab = expression(log(tilde(w))), ylab = 'Count')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
The distribution of weights looked well-behaved, with few very very small weights but no big weights.
Repeat part (a) using importance sampling with replacement. Discuss how results differ.
samples2 <- draws_approx[sample(1:1e4, 1e3, replace = TRUE, prob = prob), ]
plot(samples2[, 1], samples2[, 2], pch = 19, col = colt('steelblue', 0.1),
axes = FALSE, xlab = expression(alpha), ylab = expression(beta),
xlim = c(-4, 10), ylim = c(-10, 40))
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
It looks like that some of the draws with high posterior density were oversampled.
Show that the stationary distribution for Metropolis-Hastings algorithm is the target distribution \(p(\theta \lvert y)\)
Replicate the computations for the bioassay example of Section 3.7 using the Metropolis algorithm. Be sure to define your starting points & jumping rule. Compute log-densities. Run simulations long enough for approximate convergence.
y <- c(0, 1, 3, 5)
x <- c(-0.86, -0.3, -0.05, 0.73)
llfun <- function(theta) {
a <- theta[1]; b <- theta[2]
sum(y * (a + b * x) - 5 * log(1 + exp(a + b * x)))
}
n_samples <- 4e3
n_sampled <- 1
draws <- matrix(0, nrow = n_samples, ncol = 2)
curr <- c(1, 10)
draws[1, ] <- curr
while (n_sampled < n_samples) {
prop <- curr + c(rnorm(1, 0, 1), rnorm(1, 0, 5))
prop_dens <- llfun(prop)
curr_dens <- llfun(curr)
if (log(runif(1)) < prop_dens - curr_dens) {
draws[n_sampled + 1, ] <- prop
curr <- prop
} else {
draws[n_sampled + 1, ] <- curr
}
n_sampled <- n_sampled + 1
}
plot(draws[, 1], type = 'l', col = 'steelblue', ylab = expression(alpha))
plot(draws[, 1], type = 'l', col = 'steelblue', ylab = expression(beta))
plot(draws[, 1], draws[, 2], col = colt('steelblue', 0.1), pch = 19,
xlab = expression(alpha), ylab = expression(beta))
Table 11.4 contains quality control measurements from 6 machines in a factory. Quality control measurements are expensive and time consuming so only 5 measurements were done for each machine. In addition to the existing machines, we are interested in the quality of another (seventh) machine. Implement a separate, a pooled, and a hierarchical Gaussian model with common variance described in Section 11.6. Run simulations long enough for approximate convergence. Using each of three models - separate, pooled, and hierarchical - report: (i) the posterior distribution of the mean of the quality measurements of the sixth machine, (ii) the predictive distribution for another quality measurement of the sixth machine, and (iii) the posterior distribution of the mean of the quality of the measurement of the sevent machine.
I’ll be using the conjugate normal prior for the separate and pooled cases.
machine <- rep(1:6, each = 5)
measurement <- c(83, 92, 92, 46, 67, 117, 109, 114, 104, 87,
101, 93, 92, 86, 67, 105, 119, 116, 102, 116,
79, 97, 103, 79, 92, 57, 92, 104, 77, 100)
plot(jitter(machine, 0.5), measurement, pch = 19, col = sapply(machine, colt, 0.5),
xlab = 'Machine', ylab = 'Measurement')
rinvchisq <- function(n, df, scale = 1) (df * scale) / rchisq(n, df)
mu_0 <- 100
sigma_0 <- 5
nu_0 <- 5
k_0 <- 5
y <- measurement
sample_gibbs <- function(n_samples, y) {
n <- length(y)
mu_n <- (k_0 / (k_0 + n)) * mu_0 + (n / (k_0 + n)) * mean(y)
k_n <- k_0 + n
nu_n <- nu_0 + n
s2 <- var(y)
sigma_n <- sqrt(((nu_0 * sigma_0^2) + (n - 1) * s2 + (k_0 * n) /
(k_0 + n) * (mean(y) - mu_0)^2) / nu_n)
samples <- matrix(0, nrow = n_samples, ncol = 2)
samples[, 2] <- rinvchisq(n_samples, nu_n, sigma_n^2)
samples[, 1] <- rnorm(n_samples, mu_n, sqrt(samples[, 2] / k_n))
samples
}
separate_samples <- lapply(1:6, function(x) {
y <- measurement[machine == x]
sample_gibbs(1e3, y)
})
par(mfrow = c(2, 3), mar = c(3, 3, 1, 1))
for (i in 1:6) {
hist(separate_samples[[i]][, 1], main = NULL,
axes = FALSE, xlab = NULL, ylab = NULL,
col = 'steelblue', border = 'white')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
}
pooled_samples <- sample_gibbs(1e3, measurement)
par(mfrow = c(1, 1), mar = c(5, 4, 2, 2))
hist(pooled_samples[, 1], main = NULL,
axes = FALSE, xlab = NULL, ylab = NULL,
col = 'steelblue', border = 'white')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
hierarchical_samples <- matrix(0, nrow = 1000, ncol = 9)
hierarchical_samples[1, ] <- c(mu_0, 10, 10, rnorm(6, 100, 10))
mu <- mu_0
for (i in 1:999) {
theta <- hierarchical_samples[i, 4:9]
tau_hat <- sum((theta - mu)^2) / 5
mu_hat <- mean(theta)
sigma_hat <- mean((measurement - theta[machine])^2)
tau <- rinvchisq(1, 5, tau_hat)
mu <- rnorm(1, mu_hat, sqrt(tau / 6))
sigma <- rinvchisq(1, length(measurement), sigma_hat)
theta_hat <- sapply(1:6, function(x) {
(mu / tau + 5 * mean(measurement[machine == x]) / sigma) / (1 / tau + 5 / sigma)
})
var_theta <- 1 / (1 / tau + 5 / sigma)
theta <- rnorm(6, theta_hat, sqrt(var_theta))
hierarchical_samples[i + 1, ] <- c(mu, tau, sigma, theta)
}
par(mfrow = c(2, 3), mar = c(3, 3, 1, 1))
for (i in 1:6) {
hist(hierarchical_samples[, i + 3], main = NULL,
axes = FALSE, xlab = NULL, ylab = NULL,
col = 'steelblue', border = 'white')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
}
# Plot sds instead of variances
hierarchical_samples[, 2:3] <- sqrt(hierarchical_samples[, 2:3])
par(mfrow = c(1, 3), mar = c(3, 3, 1, 1))
for (i in 1:3) {
hist(hierarchical_samples[, i], main = NULL,
axes = FALSE, xlab = NULL, ylab = NULL,
col = 'steelblue', border = 'white')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
}
par(mfrow = c(1, 3), mar = c(3, 3, 1, 1))
hist(pooled_samples[, 1], main = NULL,
axes = FALSE, xlab = NULL, ylab = NULL,
col = 'steelblue', border = 'white')
axis(1, tick = FALSE)
axis(2, tick = FALSE, las = 1)
box(bty = 'L', col = 'grey60')
Let:
\[p(\mu, \sigma^2 ) \propto \frac{1}{\sigma^2} \qquad \text{(i.e. uniform prior on } (\mu, log(\sigma))\] and:
\[p(\mathbf{y} \lvert \mu, \sigma^2) \propto \frac{1}{\sigma^n} exp \bigg( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \bigg) \]
(i.e. class normal model with independent observations)
Then:
\[p(\mu, \sigma^2 \lvert \mathbf{y}) \propto \sigma^{-n - 2} exp \bigg( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \bigg)\] \[= \sigma^{-n - 2} exp \bigg( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \bar y + \bar y - \mu)^2 \bigg)\] \[= \sigma^{-n - 2} exp \bigg( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \bar y)^2 + (\bar y - \mu)^2 \bigg)\] \[= \sigma^{-n - 2} exp \bigg( -\frac{1}{2\sigma^2} [(n-1)s^2 + n(\bar y - \mu)^2] \bigg)\]
Where \(\bar y\) and \(s^2\) are sample mean and variable, respectively. Now:
\[p(\mu \lvert \mathbf{y}) = \int_{\sigma_2} p(\mu, \sigma^2 \lvert \mathbf{y}) d\sigma^2\] \[= \int_0^\infty \sigma^{-n - 2} exp \bigg( -\frac{1}{2\sigma^2} [(n-1)s^2 + n(\bar y - \mu)^2] \bigg)\]
Let \(A = [(n-1)s^2 + n(\bar y - \mu)^2]\) and \(z = \frac{A}{2 \sigma^2}\). Then:
\[\frac{1}{2 \sigma^2} = \frac{z}{A} = zA^{-1}; \qquad \frac{1}{\sigma} \propto \frac{\sqrt{z}}{\sqrt{A}} = z^{\frac{1}{2}}A^{-\frac{1}{2}}; \qquad \frac{\partial \sigma^2}{\partial z} \propto z^{-\frac{3}{2}} A^{-\frac{1}{2}}\]
Thus:
\[p(\mu \lvert \mathbf{y}) \propto \int_0^\infty (z^{\frac{1}{2}} A^{-\frac{1}{2}})^{n + 2} exp \bigg( \frac{z}{A} \cdot A \bigg) \cdot z^{-2} A \; dz\] \[= A^{-\frac{n}{2}} \int_0^\infty z^{\frac{n}{2} - 1} exp(-z) dz\] \[= A^{-\frac{n}{2}} \cdot c \qquad \text{(recognizing the above as Gamma integral)}\] \[= [(n - 1)s^2 + n(\bar y - \mu)^2]^{-\frac{n}{2}}\] \[= \bigg[1 + \frac{n(\bar y - \mu)^2}{(n-1)s^2} \bigg]^{-\frac{n}{2}}\]
Therefore:
\[\mu \lvert \mathbf{y} \sim t_{n-1} \bigg(\bar y, \frac{s^2}{n} \bigg)\]